Lexer- Tokenizer problem

Reply

Join Date: Nov 2006
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 10
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Lexer- Tokenizer problem

 
0
  #1
Aug 28th, 2007
hi, i use the flex tool {http://www.gnu.org/software/flex/manual/} to generate a tokenizer ,but i have the following problem {it has to do with the way that flex tokenizes the input::

FILE : flex.l

  1. %{
  2. #define WEB 0
  3. #define SPACE 1
  4. #define STRING 2
  5. %}
  6.  
  7. string_component [0-9a-zA-Z \t\.!#$%^&()*@_]
  8.  
  9. %%
  10.  
  11. "daniweb" {return WEB;}
  12. [ \t\n] {return SPACE;}
  13. {string_component}+ {return STRING;}
  14.  
  15. %%
  16.  
  17. #include <iostream>
  18.  
  19. using namespace std;
  20.  
  21. int main()
  22. {
  23. cout<<yylex()<<endl;
  24. cout<<yylex()<<endl;
  25.  
  26. return 0;
  27. }
  28.  
  29. int yywrap(void){return 1;}

Example file:
  1. test_string daniweb

What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....

How can i fix this problem?
all ideas are welcomed....

PS:: to compile:
  1. flex flex.l
  2. g++ lex.yy.c
  3. ./a.out <example
Reply With Quote Quick reply to this message  
Join Date: Dec 2005
Posts: 5,851
Reputation: Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute 
Solved Threads: 749
Team Colleague
Salem's Avatar
Salem Salem is offline Offline
Void main'ers are DOOMed

Re: Lexer- Tokenizer problem

 
0
  #2
Aug 28th, 2007
Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.
Reply With Quote Quick reply to this message  
Join Date: Nov 2006
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 10
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Re: Lexer- Tokenizer problem

 
0
  #3
Aug 29th, 2007
Originally Posted by Salem View Post
Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.

Thank you for answering {apparently, few people have read the post...}

Yes you are rigth, it seems that i can't have it both ways... but from where i stand i want to use flex in order to do the following:::

Recognize some specif keywords {in the simplified example i provided the keyword was "daniweb"} and recognize everything else as a string...any ideas on how can i do that?

PS: maybe start conditions could help me solve the problem?{ i havven't understand them so well...}
PS2:in the beggining i thought it wouldn't be that difficult, but i was wrong...
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 5,264
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 376
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

 
0
  #4
Aug 29th, 2007
What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?
Last edited by iamthwee; Aug 29th, 2007 at 2:42 pm.
Reply With Quote Quick reply to this message  
Join Date: Nov 2006
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 10
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Re: Lexer- Tokenizer problem

 
0
  #5
Aug 29th, 2007
Originally Posted by iamthwee View Post
What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?
Flex

Flex (The Fast Lexical Analyzer)
Flex is a fast lexical analyser generator. It is a tool for generating programs that perform pattern-matching on text. Flex is a non-GNU free implementation of the well known Lex program.


http://www.gnu.org/software/flex/
http://flex.sourceforge.net/
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 5,264
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 376
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

 
0
  #6
Aug 29th, 2007
Um ok, please explain this:

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

and what you think it does?
Last edited by iamthwee; Aug 29th, 2007 at 3:11 pm.
Reply With Quote Quick reply to this message  
Join Date: Aug 2007
Posts: 1
Reputation: nedrocks is an unknown quantity at this point 
Solved Threads: 0
nedrocks nedrocks is offline Offline
Newbie Poster

Re: Lexer- Tokenizer problem

 
0
  #7
Aug 29th, 2007
There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.
Last edited by nedrocks; Aug 29th, 2007 at 5:40 pm.
Reply With Quote Quick reply to this message  
Join Date: Nov 2006
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 10
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Re: Lexer- Tokenizer problem

 
0
  #8
Aug 30th, 2007
Originally Posted by nedrocks View Post
There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.
i haven't seen what you mention in the manual...

unfortunately i haven't found the solution...i worked around my problem by changing the grammar {i.e. bison file}, and finally i gave the project... Now when i find the time i will try to find a solution using starting conditions
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 5,264
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 376
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

 
0
  #9
Aug 30th, 2007
First you gotta know what your regular expressions are doing.

To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,089
Reputation: vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all 
Solved Threads: 164
vijayan121 vijayan121 is offline Offline
Veteran Poster

Re: Lexer- Tokenizer problem

 
0
  #10
Aug 30th, 2007
using boost.spirit may be much easier: http://www.boost.org/libs/spirit/doc/quick_start.html
  1. #include <boost/spirit/core.hpp>
  2. #include <iostream>
  3. #include <string>
  4. #include <vector>
  5. #include <algorithm>
  6. #include <boost/assign.hpp>
  7. using namespace std ;
  8. using namespace boost ;
  9. using namespace boost::spirit ;
  10. using namespace boost::assign ;
  11.  
  12. struct parse_it
  13. {
  14. void operator() ( const string& str ) const
  15. {
  16. vector<string> tokens ;
  17. const char* cstr = str.c_str() ;
  18. size_t n = 0 ;
  19. while( n < str.size() )
  20. n += parse( cstr + n,
  21. (+space_p) [ push_back_a( tokens, "SPACE" ) ] |
  22. str_p("daniweb") [ push_back_a( tokens, "WEB" ) ] |
  23. str_p("lexer") [ push_back_a( tokens, "LEX" ) ] |
  24. str_p("tokenizer") [ push_back_a( tokens, "TOK" ) ] |
  25. (+~space_p) [ push_back_a( tokens, "STRING" ) ]
  26. ).length ;
  27. cout << '\n' << "parsed: " << str << "\ntokens: " ;
  28. copy( tokens.begin(), tokens.end(),
  29. ostream_iterator<string>(cout," ") ) ;
  30. cout << '\n' ;
  31. }
  32. };
  33. int main()
  34. {
  35. vector<string> test_cases = list_of
  36. ( "test daniweb lexer xyz tokenizer lexer" )
  37. ( "daniweblexer tokenizerlexer abcd lexerlexer" )
  38. ( "daniwebtest lexerdaniweblexertest tokenizerxxx" ) ;
  39. for_each( test_cases.begin(), test_cases.end(), parse_it() ) ;
  40. }
  41. /**
  42. >g++ -Wall -std=c++98 -I/usr/local/include keyword.cpp && ./a.out
  43.  
  44. parsed: test daniweb lexer xyz tokenizer lexer
  45. tokens: STRING SPACE WEB SPACE LEX SPACE STRING SPACE TOK SPACE LEX
  46.  
  47. parsed: daniweblexer tokenizerlexer abcd lexerlexer
  48. tokens: WEB LEX SPACE TOK LEX SPACE STRING SPACE LEX LEX
  49.  
  50. parsed: daniwebtest lexerdaniweblexertest tokenizerxxx
  51. tokens: WEB STRING SPACE LEX WEB LEX STRING SPACE TOK STRING
  52. */
Last edited by vijayan121; Aug 30th, 2007 at 2:55 pm.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the C++ Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC