Lexer- Tokenizer problem
Please support our C++ advertiser: Programming Forums
![]() |
hi, i use the flex tool {http://www.gnu.org/software/flex/manual/} to generate a tokenizer ,but i have the following problem {it has to do with the way that flex tokenizes the input::
FILE : flex.l
Example file:
What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....
How can i fix this problem?
all ideas are welcomed....
PS:: to compile:
FILE : flex.l
%{
#define WEB 0
#define SPACE 1
#define STRING 2
%}
string_component [0-9a-zA-Z \t\.!#$%^&()*@_]
%%
"daniweb" {return WEB;}
[ \t\n] {return SPACE;}
{string_component}+ {return STRING;}
%%
#include <iostream>
using namespace std;
int main()
{
cout<<yylex()<<endl;
cout<<yylex()<<endl;
return 0;
}
int yywrap(void){return 1;}Example file:
test_string daniweb
What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....
How can i fix this problem?
all ideas are welcomed....
PS:: to compile:
flex flex.l g++ lex.yy.c ./a.out <example
Your string component matches spaces, and now you're complaining that you don't want to match spaces.
You can't have it both ways.
You can't have it both ways.
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
UK Voter? Please send a message to Incapability Brown and the rest of Zanu-Labour
Up to 8Mb PlusNet broadband from only £5.99 a month!
UK Voter? Please send a message to Incapability Brown and the rest of Zanu-Labour
Up to 8Mb PlusNet broadband from only £5.99 a month!
•
•
•
•
Your string component matches spaces, and now you're complaining that you don't want to match spaces.
You can't have it both ways.
Thank you for answering {apparently, few people have read the post...}
Yes you are rigth, it seems that i can't have it both ways... but from where i stand i want to use flex in order to do the following:::
Recognize some specif keywords {in the simplified example i provided the keyword was "daniweb"} and recognize everything else as a string...any ideas on how can i do that?
PS: maybe start conditions could help me solve the problem?{ i havven't understand them so well...}
PS2:in the beggining i thought it wouldn't be that difficult, but i was wrong...
•
•
•
•
What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?
Flex
Flex (The Fast Lexical Analyzer)
Flex is a fast lexical analyser generator. It is a tool for generating programs that perform pattern-matching on text. Flex is a non-GNU free implementation of the well known Lex program.
http://www.gnu.org/software/flex/
http://flex.sourceforge.net/
•
•
•
•
There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.
unfortunately i haven't found the solution...i worked around my problem by changing the grammar {i.e. bison file}, and finally i gave the project... Now when i find the time i will try to find a solution using starting conditions
•
•
Posts: 1,087
Reputation:
Solved Threads: 164
using boost.spirit may be much easier: http://www.boost.org/libs/spirit/doc/quick_start.html
cpp Syntax (Toggle Plain Text)
#include <boost/spirit/core.hpp> #include <iostream> #include <string> #include <vector> #include <algorithm> #include <boost/assign.hpp> using namespace std ; using namespace boost ; using namespace boost::spirit ; using namespace boost::assign ; struct parse_it { void operator() ( const string& str ) const { vector<string> tokens ; const char* cstr = str.c_str() ; size_t n = 0 ; while( n < str.size() ) n += parse( cstr + n, (+space_p) [ push_back_a( tokens, "SPACE" ) ] | str_p("daniweb") [ push_back_a( tokens, "WEB" ) ] | str_p("lexer") [ push_back_a( tokens, "LEX" ) ] | str_p("tokenizer") [ push_back_a( tokens, "TOK" ) ] | (+~space_p) [ push_back_a( tokens, "STRING" ) ] ).length ; cout << '\n' << "parsed: " << str << "\ntokens: " ; copy( tokens.begin(), tokens.end(), ostream_iterator<string>(cout," ") ) ; cout << '\n' ; } }; int main() { vector<string> test_cases = list_of ( "test daniweb lexer xyz tokenizer lexer" ) ( "daniweblexer tokenizerlexer abcd lexerlexer" ) ( "daniwebtest lexerdaniweblexertest tokenizerxxx" ) ; for_each( test_cases.begin(), test_cases.end(), parse_it() ) ; } /** >g++ -Wall -std=c++98 -I/usr/local/include keyword.cpp && ./a.out parsed: test daniweb lexer xyz tokenizer lexer tokens: STRING SPACE WEB SPACE LEX SPACE STRING SPACE TOK SPACE LEX parsed: daniweblexer tokenizerlexer abcd lexerlexer tokens: WEB LEX SPACE TOK LEX SPACE STRING SPACE LEX LEX parsed: daniwebtest lexerdaniweblexertest tokenizerxxx tokens: WEB STRING SPACE LEX WEB LEX STRING SPACE TOK STRING */
Last edited by vijayan121 : Aug 30th, 2007 at 1:55 pm.
![]() |
Similar Threads
Other Threads in the C++ Forum
- StringTokenizer problem (Java)
- simple program tokenizer problem (Java)
Other Threads in the C++ Forum
- Previous Thread: please help
- Next Thread: Optomizing for Pentium Pro
•
•
•
•
Views: 1457 | Replies: 11 | Currently Viewing: 1 (0 members and 1 guests)






Linear Mode