I recently took a course on assembler/compiler construction. In it we covered parsing algorithms such as LL(n), LR(n), LALR(n), and SLR(n).

I understand how these parsing algorithms can be used to determine if an input string follows a context free grammar (CFG). At some point I also understood how to tokenize the string using the grammar, however my assignment code is not commented (in hindsight that was a mistake) and I do not understand it anymore.

Now I am in need of a parser to convert lines of code into useful tokens and cannot for the life of me remember how to use a parsing algorithm and a context free grammar to do lexical analysis on a string to extract tokens.

How can I take a generic CFG (with accompanying deterministic finite automaton) and an input string to generate an array of tokens?

3 Years
Discussion Span
Last Post by Labdabeta

my assignment code is not commented (in hindsight that was a mistake) and I do not understand it anymore.

I can't tell you the number of times I've heard that same story.

Sorry, but I can't help you with your question.


You do not use parsing algorithms to generate a list of tokens. The list of tokens is the input to the parsing algorithm. The output of the algorithm is generally an AST or some kind of intermediate representation of the code (or in some cases even the end product of the compilation or interpretation).

To generate the list of tokens you write a so-called lexer, which usesa totally different algorithm. This algorithm is usually based on regular expressions, not CFGs, and works by implementing a state machine that keeps track of which of the possible regular expressions can still match the input read so far.

A lexer can either generated by tools like lex (or its many clones like flex) or written by hand. Since the tokens for most languages follow quite simple rules, hand-writing lexers can often be relatively simple. For example a state machine for many common languages could follow rules like this:

Start-state: If the current character is a letter or underscore, switch to ID-state. If it's a digit, switch to Number-state. If it's a quote, switch to String-state. If it's an operator-start character, switch to operator state. If it's white space, discard it.

ID-state: If the current character is in [0-9a-zA-Z_], add it to the token. Otherwise: token is finished; if the characters read form a keyword, it's a keyword token; otherwise it's an identifier token. Return to Start-state.

Number-state: If the current character is a valid character for a number (possibly taken into account suffix characters like 'l' that may appear only at the end or like 'E' that may only appear once by introducing sub-states) add it to the token. Else generate a number token from the read tokens and return to start state.

String-state: (Remembering backslashes to know whether quotes are escaped and possibly to parse escape sequences:) If the current character is not an unescaped quote, add it to the token. Else generate a string token and return to start state.

Operator-state: Read the longest string of character that forms a valid operator and then generate a corresponding token. This is the only state that may require backtracking when implemented naively, but since operators are usually no longer than three characters, that should not be very problematic.

Votes + Comments

I knew I was forgetting a step. I do recall lexers and lexemes. Once I have the tokens, each one acts as a terminal in the context free grammar. Then I just create the derivation tree of the grammar and it should be identical to the syntax tree I am trying to create.

Thank you!

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.