Consider the below statement

if(i==2)

The lexical analyzer reads the source program one character at a time.So if we divide the statement into characters it would divide if into i&f and it would say that i is an identifier.But it doesn't do this.So is there any look ahead mechanism which tells it that i is followed by f and if is classified as a keyword.

Recommended Answers

All 6 Replies

Homework, right? Sorry, but we don't do your work for you. So, what IS the question? :-)

Its not homework.I had a doubt which i posted...

There is no need for look-ahead in this case; rather, once the analyzer begins a token that could be an identifier, it continues collecting the token's characters until it finds one that isn't a possible character in an ientifier token (e.g., a whitespace, or some symbol that isn't used in identifiers such as a plus or an ampersand), and once it has the whole token string, it then checks to see if it is a keyword.

With your typical lexical analyzer, the analyzer is written as a Deterministic Finite Automaton, a conceptual 'device' which 'recognizes' a regular language by reading in the symbols of the language one at a time and changing its own state based on the symbols. Each state has a finite number of possible transitions, each of which represents a possible valid input in the regular language of the language. In the case of a lexical analyzer, the 'language' it recognizes is the formal specification of the tokens that the overall language is comprised of.

So suppose we have following 2 statements

if(i==2)
i=2;

So would the lexical analyzer first put "i" in the yytext and then "f" in yytext and then as it encounters "(" it would match yytext i.e "if" with the regular expression that we wrote for the keyword and ultimately know that if is a keyword?
In the second statement it would put "i" in the yytext and soon as it encounters "=" it would match yytext i.e "i" with the regular expression of an identifier and infer that "i" is an identifier?
But still i am not getting the fact that how would it know that "(" isn't a possible character in identifier and would not take it and then compare yytext with regular expression for keyword if we are talking about "if"

Ah, I see what you are getting at now. Yes, the DFA does read in the '(' character, which it recongnizes as not being a possible part of an indentifier, so (in most designs like this) it pushes the parenthesis back into the input stream, and changes state to the 'finished' state.

Now, I am talking about this as if the DFA were actually coded as an explicit state machine, but this isn't necessarily the case; there are a few approaches to implementing the lexer's DFA, some explict and some implict. Since you mention 'yytext', I assume you are using LEX or one of its descendants (e.g., flex), so you would be talking about an explicit DFA as opposed to an ad-hoc approach. Still, the lexical analyzer generator does a lot of these things automatically, and so may be obscuring things you are trying to grasp. You may want to look at the StateMachine and CharBuffer classes in my Suntiger Algol project as a comparison, since that uses an explicit DFA that actually is defined in terms of transitions. The code is n Python, but even if you don't kow the language it should be fairly clear.

Please correct me if i am wrong.
We write regular expressions in lex tool to divide the input stream into tokens and classify them by class if it belongs to the regular expressions we wrote.But lex tool can't understand whether the string belongs to the regular expression.For that purpose it internally constructs a DFA.If for the given input string the DFA moves to the final state then the lex tool recognizes that the string belongs to the regular expression and we classify the string on the class specified for that regular expression.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.