Refactoring the Soar Lexer

After my last writeup on how Soar parses Soar code, I decided to dive in and try and refactor it, starting with the lexer (and here’s the resulting PR). I haven’t worked with C/C++ in a while, and a good part of the code I changed was close to my age, so it was an interesting learning experience for me.

The lexer used global state stored via an agent to keep track of the current input character and lexeme as well as to remember line and column numbers. The goal of the refactoring was of course to change that, but 30 commits later I have not yet completely separated the agent from the Lexer because all printing logic (warnings, etc.) requires an agent. The result of this is that after every change I make I have to do build Tests and watch the entire project be rebuilt because of the various interdependencies. Also, the lexer has none if its own tests, and so the full test suite must be run to find possible problems. The result here is that every time I make a change and need to see if it works, I start a long compile/test process running and watch more of some show while waiting for it to finish (you’ve probably seen this comic before). So the global state is not only difficult to work with, but it also slows development to a crawl.

The lexer had a lot of dead code which told me the story of how it used to work. Originally it would read Soar code straight from a file; later it was retrofitted to take production strings directly as an “alternate”. There must have been something off about the string input, however, because besides setting alternate_input_string, alternate_input_suffix was also required, and always needed to be set to ") ". Once the production string was finished, the lexer would read the suffix. The original reason for this was long gone, but the parser still expected productions to end with an R_PAREN lexeme, as did other lexer clients. Later developers copied the incantation to lex a string, not knowing why it was needed, and the refactored jSoar parser implementation has the check for the parenthesis commented and marked “this makes no sense to me”. So the current functionality started out as an alternative for testing or something and then became the only use case, leaving dead code everywhere.

Some of the updating work I did was already done in the [9.5] branch, but it was still interesting. There were workarounds for the old Think C and Symantech C++ compilers, between THINK_C and __SC__ #ifdefs. I had no idea what they were and found an excellent compiler macros list project on SourceForge. Apparently C had no boolean variable types, either, so programmers used typedef char Bool and defined TRUE and FALSE manually, and part of updating to C++ code is changing those to real bool types.

The value of making small, working commits was reinforced when my initial pull request broke the automated build when --no-scu was set. This setting causes each .cpp file to be compiled separately, ensuring that the build fails if a #include directive was forgotten somewhere. I was able to bisect the Git history and discover the problem pretty quickly.

Nuggets:

New TODO’s: