Say I have something similar to this: http://faculty.utpa.edu/chebotkoa/main/teaching/csci3336spring2013/slides/lexsyn/sample-lex-syn.html

Does anyone know how I would be able to make "\n" or "\t" recognized as a token? I would probably need to add a case for it in the switch in LexicalAnalysis.cpp, right? And add it to the state diagram? But how would I modify SyntaxAnalysis.cpp? Would I do something like:

if (nextToken == NEWLINE){
nextToken = LA.lex();
}

or would I have to do "\" and "n" separately?

   if (nextToken == "\"){
    nextToken = LA.lex();
        if(nextToken == "n"){
    nextToken = LA. lex();
        }
    }

Edited 2 Years Ago by munchlaxxx

I would probably need to add a case for it in the switch in LexicalAnalysis.cpp, right? And add it to the state diagram?

If you want them as separate token types, then yes to both, though most languages treat all whitespace as a single token type.

The answer to the last part would depend on whether you are trying to represent the actual newline (which can be either a single byte or two bytes, depending on the operating system), or the escape representation of the newline. I assume you want the newline itself, in which case you would use the latter version.

You would only need to parse the escape form when handling character and string literals, and since this mini-language doesn't have any quoted literal token types, it isn't relevant. Not yet anyway - you'll want to keep it in mind when you start working on more elaborate language forms later.

Thank you for responding! I've been working on this for a while now, changing different things around...

#include "Lexer.h"

// Main to test the Lexer
/*
int main(void) {
    Lexer lexer; // create a lexer object
    int token;
    if (lexer.openFile()) {
       lexer.getChar();
       do {
          token = lexer.lex();
       } while (token != EOF);
    }
}
*/

Lexer::Lexer(void)
{   
    fillSpecialWords();
}

Lexer::~Lexer(void)
{
    closeFile();
}


// Open File

bool Lexer::openFile() {
    file.open("lexerInput.txt");
    if (!file.is_open()) {
        printf("ERROR - cannot open lexerInput.txt \n");
        return false;
    }
    return true;
}


// CLOSE FILE

void Lexer::closeFile() { 
    file.close(); 
    system("pause"); ///debug testing purposes
}


// FILL SPECIAL WORDS

void Lexer::fillSpecialWords() {
    cout << "About to fill special words map\n";

    specialWord[string("If")] = 14;
    specialWord[string("then")] = 15;
    specialWord[string("otherwise")] = 16;
    specialWord[string("Do")] = 17;
    specialWord[string("until")] = 18;
    specialWord[string("While")]= 19;
    specialWord[string("Print")] = 20;
    specialWord[string("is")] = 21;
    specialWord[string("equals")] = 22;
    specialWord[string("does_not_equal")] = 23;
    specialWord[string("is_less_than")] = 30;
    specialWord[string("is_greater_than")] = 31;
    specialWord[string("plus")] = 32;
    specialWord[string("minus")] = 33;
    specialWord[string("times")] = 40;
    specialWord[string("divided_by")] = 41;
    specialWord[string("mod")] = 42;
    specialWord[string("and")] = 43;
    specialWord[string("or")] = 44;
    specialWord[string("End_loop")] = 51;
    specialWord[string("Define_function")] = 53;
    specialWord[string("End_function")] = 54;
    specialWord[string("Call")] = 55;
    specialWord[string("Prompt")] = 56;
    specialWord[string("Return")] = 57;
    specialWord[string("PLUS_CODE")]= 58;
    specialWord[string("MINUS_CODE")] = 59;
    specialWord[string("TIMES_CODE")] = 60;
    specialWord[string("MOD_CODE")] = 61;
    specialWord[string("DIVIDED_BY_CODE")] = 62;


}


// LOOKUP 
// returns the appropriate token to single-char operators (and defines it as its own lexeme)

int Lexer::lookup(char ch) { 
    switch (ch) { //calls file.unget() when looking ahead for more than one character but there isn't one, so everything stays consistent
        case '.':
            addChar();
            nextToken = PERIOD;
            break;
        case '_':
            addChar();
            nextToken = UNDERSCORE;
            break;  
        case '\n':
            addChar();
            nextToken = NEWLN_CODE;
            break;
        case '"':
            addChar();
            nextToken = QUOTE_CODE;
            break;
        case '\t':
            addChar();
            nextToken = INDENT_CODE;
            break;
        default:
            addChar();
            nextToken = EOF;
            break;
    }
    return nextToken;
}


// ADDCHAR
//    adds nextChar to the lexeme


void Lexer::addChar() { 
    if (lexLen <= 98) {
        //cout << "Character " << nextChar << " placed at index " << lexLen << "\n";
        lexeme[lexLen++] = nextChar;
        lexeme[lexLen] = 0;
    }
    else 
        printf("Error - lexeme is too long \n");
}



// GETCHAR
//    gets next char and returns its character class\


void Lexer::getChar() { 

    if (file.good()) {
        nextChar = file.get();
        //cout << "Next character:" << nextChar << "|" << endl;

        if (isalpha(nextChar)) 
            charClass = LETTER;     
        else if (isdigit(nextChar)) 
            charClass = DIGIT;  

        else if (nextChar == '_')
            charClass = UNDERSCORE;

        else if (nextChar == '\n')
            charClass = SPACE;

        else if (nextChar == '\t')
            charClass = SPACE;

        else 
            charClass = UNKNOWN;            
    }
    else 
        charClass = EOF;
}


    //just skips whitespace

    void Lexer::getNonBlank() { 
        while (isspace(nextChar))
            getChar();
    }


// Reads in next lexemes and returns its associated token

int Lexer::lex() {
    cout << "LEX\n";

    lexLen = 0;
    getNonBlank();  // Eat up white space

    switch (charClass) {    
        //parse identifiers
        case LETTER:
            addChar();
            getChar();
            while (charClass == LETTER || charClass == DIGIT || charClass == UNDERSCORE) {
                addChar();
                getChar();
            }
            nextToken = specialWord[string(lexeme)];
            break;

        //parse ints
        case DIGIT:
            addChar();
            getChar();
            while (charClass == DIGIT) {
                addChar();
                getChar();
            }
            nextToken = INT_LIT;
            break;      


        // Single characters
        case UNKNOWN:
            lookup(nextChar);
            getChar();
            break;

        case EOF:
            nextToken = EOF;
            lexeme[0] = 'E';
            lexeme[1] = 'O';
            lexeme[2] = 'F';
            lexeme[3] = 0;
            break;
    } //end of switch

    printf("lexeme: %s \n",lexeme);
    return nextToken;
}

Is the way I added '\n' and '\t' okay? (in lookup, along with the other symbols and in getChar) What about '"'?

Edit: Should I add the charClass SPACE to lex()? I'm getting confused...

Edited 2 Years Ago by munchlaxxx

Is the way I added '\n' and '\t' okay?

I'm not really sure, to be honest; it depends on whether you actually want to have some significance to them or not. In most (but not all) current languages, whitespace of all types (spaces, newlines, tabs) are treated as terminals, that is, they automatically end the current token. They generally aren't treated as tokens in and of themselves, or if they are, they are lumped together in a single token type. However, there are languages where whitespace is significant (e.g., Python), so treating them as tokens might be reasonable in those cases. It depends on the language being recognized.

As for your handling of the double-quote, that seems reasonable, from what I have seen of this so far. Without knowing more about your intended design I can't say much more.

Edited 2 Years Ago by Schol-R-LEA

This article has been dead for over six months. Start a new discussion instead.