I have to develop a simple parser, to read "block" of text for example:

/TEST
 {. text .}
/TEST_DATA
 {. infs .}

and, I need to read informations of inside of label.... and... the file with have this informations... have a lot of labels, with the same perfil

for example:

/TEST
 {. text .}
/TEST_DATA
 {. infs .}

/LBL1
 {. text .}
/LBL1_DATA
 {. infs .}

/LBL2
 {. text .}
/LBL2_DATA
 {. infs .}

/LBL3
 {. text .}
/LBL3_DATA
 {. infs .}

i need to read the block of specif label, for example:

parseFile("FileName.txt", LBL1)

and the function, return for me, the text of inside the blocks: LBL1 and LBL1_DATA or, return for me, the content of LBL1 and LBL1_DATA

I don't know, how I can do this :xxx I need help ;x

Thanks.

Recommended Answers

All 5 Replies

Most simple solution if you know max line length and max text length would be using fgetline then compare what you got to desired if it is it read next line data. with the same thing fgetline
or fread bot then you haveto parse chunks which would be more harder or read whole file to buffer and prase if it's size is limited

OK, let's start with the basics. In general, the first step in parsing a stream of text is to break the text up into lexemes (tokens), right? So, what you want to do is work out the basic components of the input format, write a simple lexical analyzer to tokenize the input stream, and write your parser to work off of that token stream.

Without more information about the particular file format or language you're parsing, it's hard to say exactly what your tokens would be, but from the example you gave, you might have /, {., .}, _ and strings of aphanumeric values, as your lexemes. There may be additional production rules that aren't quite so obvious (e.g., what to do with spaces, newlines, and other whitespace), but this seems a likely set of values right now. This would give us a set of tokens and a token structure type something like this:

enum TOKEN_TYPE {INVALID = -1, EOF, FWD_SLASH, OPEN_BRACE, CLOSE_BRACE, UNDERSCORE, STRING};

struct Token 
{
    TOKEN_TYPE id;
    char value[MAX_TOKEN_SIZE];
};

Now you need to work out how to handle the text stream. As Sokurenko seems to have meant to say, the fgets() function is a good place to start; however, it gets a bit more complicated than that, or at least it can, depending on what (if anything) a newline character means in the given source language or format. Also, while you want to read in the input stream a line at a time, you will almost certainly need to process the input character by character. You'll want to write a function that reads in a line of source, then returns that line one character at a time, until the line it through, at which point it reads another line:

char get_next_char(FILE* infile)
{
    static char buffer[MAX_LINE_SIZE] = "";
    static unsigned counter = 0;
    static unsigned line = 0;

    if (buffer[counter] == '\0')
    {
        if (fgets(buffer, MAX_LINE_SIZE, infile) == NULL)
        {
            return NULL;
        }
        count = 0;
        line++;
    }
    else
    {
        counter++;
    }
    return buffer[counter];
}

You would then use this in the top-level lexical analyzer function to start each new token.

Token* get_token(FILE* infile)
{
    char first, next;

    first = get_next_char(infile);
    if (first == NULL)
    {
        return EOF;   /* end of file token */
    }
    if (first == '{')
    {
        next = get_next_char(infile);
        if (next == '.')
        {
            return OPEN_BRACE;
        }
        else
        {
            return INVALID;   /* there is no production that has '{' followed by anythig other than '.' */
        }

    else if (first == '.')
    /* and so on */

    else
    {
        return INVALID;
    }
}

This should hopefully get you started; it doesn't address the parsing, but it does take care of a necessary prerequisite. I'm doing this a bit more formally than is probably necessary, but that's to make it clearer what I mean. HTH.

Learn about finite state machines (FSM). This is how professionals create parsers. There are tools that help with this (Yacc/Lex et al) once you have written the rules needed by the tools. They then create C code that will do the heavy lifting for you. Remember, Google is your friend!

Also, fwiw, the modern version of Yacc (a Unix tool) is Bison (a GNU tool that is compatible with Yacc).

@rubberman
arent finite state machines just regular expressions ??

Rubberman: To be more specific, for any regular grammar, there is a Deterministic Finite Automaton (DFA) capable of accepting the language described by the grammar. Because DFA are very easy to code, and can be generated automatically, almost every modern language is designed so that the lexemes form a regular language. Thus, it is typical to define the lexical analyzer by creating a regular grammar for the lexemes, and create a DFA to implement it. The approach I demonstrated above is a less formal version of this, but it still works by implementing a DFA; if I were being more formal, I (or the preferably, the OP) would have defined a grammar for the lexemes first, then worked from the production rules step-by-step. Since this is clearly a course assignment, I wouldn't have suggested using a lexer generator, though it is good to point that out.

DFA are not capable of recognizing context-free grammars, which includes pretty much all Turing-complete programming languages (those which aren't themselves context-sensitive, that is; note, however, that the fact that the language's semantics are Turing-complete does not require the language's grammar to be recursively-enumerable). Thus, a parser for most programming languages has to be more complicated than a DFA, requiring the equivalent of a Push-Down Automaton to recognize it. This is why parsers and lexical analyzers are different from each other.

Without more information (say, a pre-defined grammar written in EBNF), it is hard to say whether this language requires a full context-free grammar or not. As simple as it it, it may be possible to define it with a regular grammar, in which case the whole 'parser' would be rather simple.

Sokurenko: While the two are related, they are not the same. Both regular expressions and Deterministic Finite State Automata are capable of recognizing a regular language, but DFA are a more primitive form, and are often used in implementing REs. REs are useful mainly because they present a very compact notation for pattern matching, but they are actually not that easy to implement without using a DFA underneath.

commented: oh i see now, i can write regular expression for every automata bot not other way :) +2
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.