RSS Forums RSS

Lexer- Tokenizer problem

Please support our C++ advertiser: Programming Forums
Reply
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 9
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Question Lexer- Tokenizer problem

  #1  
Aug 28th, 2007
hi, i use the flex tool {http://www.gnu.org/software/flex/manual/} to generate a tokenizer ,but i have the following problem {it has to do with the way that flex tokenizes the input::

FILE : flex.l

%{
		#define WEB 0
		#define SPACE 1
		#define STRING 2	
%}

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

%%

"daniweb"		              {return WEB;}
[ \t\n]			{return SPACE;}
{string_component}+	{return STRING;}

%%

#include <iostream>
			
using namespace std;
		
int main()
{	
	cout<<yylex()<<endl;
	cout<<yylex()<<endl;

	return 0;
}

int yywrap(void){return 1;}

Example file:
test_string daniweb

What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....

How can i fix this problem?
all ideas are welcomed....

PS:: to compile:
flex flex.l
g++ lex.yy.c
./a.out <example
AddThis Social Bookmark Button
Reply With Quote  
Posts: 5,133
Reputation: Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute 
Solved Threads: 634
Colleague
Salem's Avatar
Salem Salem is offline Offline
Void main'ers are DOOMed

Re: Lexer- Tokenizer problem

  #2  
Aug 28th, 2007
Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
UK Voter? Please send a message to Incapability Brown and the rest of Zanu-Labour
Up to 8Mb PlusNet broadband from only £5.99 a month!
Reply With Quote  
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 9
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Question Re: Lexer- Tokenizer problem

  #3  
Aug 29th, 2007
Originally Posted by Salem View Post
Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.



Thank you for answering {apparently, few people have read the post...}

Yes you are rigth, it seems that i can't have it both ways... but from where i stand i want to use flex in order to do the following:::

Recognize some specif keywords {in the simplified example i provided the keyword was "daniweb"} and recognize everything else as a string...any ideas on how can i do that?

PS: maybe start conditions could help me solve the problem?{ i havven't understand them so well...}
PS2:in the beggining i thought it wouldn't be that difficult, but i was wrong...
Reply With Quote  
Posts: 5,068
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 355
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

  #4  
Aug 29th, 2007
What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?
Last edited by iamthwee : Aug 29th, 2007 at 1:42 pm.
Reply With Quote  
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 9
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Re: Lexer- Tokenizer problem

  #5  
Aug 29th, 2007
Originally Posted by iamthwee View Post
What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?


Flex

Flex (The Fast Lexical Analyzer)
Flex is a fast lexical analyser generator. It is a tool for generating programs that perform pattern-matching on text. Flex is a non-GNU free implementation of the well known Lex program.


http://www.gnu.org/software/flex/
http://flex.sourceforge.net/
Reply With Quote  
Posts: 5,068
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 355
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

  #6  
Aug 29th, 2007
Um ok, please explain this:

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

and what you think it does?
Last edited by iamthwee : Aug 29th, 2007 at 2:11 pm.
Reply With Quote  
Posts: 1
Reputation: nedrocks is an unknown quantity at this point 
Solved Threads: 0
nedrocks nedrocks is offline Offline
Newbie Poster

Re: Lexer- Tokenizer problem

  #7  
Aug 29th, 2007
There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.
Last edited by nedrocks : Aug 29th, 2007 at 4:40 pm.
Reply With Quote  
Posts: 202
Reputation: n.aggel is an unknown quantity at this point 
Solved Threads: 9
n.aggel's Avatar
n.aggel n.aggel is offline Offline
Posting Whiz in Training

Re: Lexer- Tokenizer problem

  #8  
Aug 30th, 2007
Originally Posted by nedrocks View Post
There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.
i haven't seen what you mention in the manual...

unfortunately i haven't found the solution...i worked around my problem by changing the grammar {i.e. bison file}, and finally i gave the project... Now when i find the time i will try to find a solution using starting conditions
Reply With Quote  
Posts: 5,068
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 355
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: Lexer- Tokenizer problem

  #9  
Aug 30th, 2007
First you gotta know what your regular expressions are doing.

To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.
Reply With Quote  
Posts: 1,087
Reputation: vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all 
Solved Threads: 164
vijayan121 vijayan121 is offline Offline
Veteran Poster

Re: Lexer- Tokenizer problem

  #10  
Aug 30th, 2007
using boost.spirit may be much easier: http://www.boost.org/libs/spirit/doc/quick_start.html
  1. #include <boost/spirit/core.hpp>
  2. #include <iostream>
  3. #include <string>
  4. #include <vector>
  5. #include <algorithm>
  6. #include <boost/assign.hpp>
  7. using namespace std ;
  8. using namespace boost ;
  9. using namespace boost::spirit ;
  10. using namespace boost::assign ;
  11.  
  12. struct parse_it
  13. {
  14. void operator() ( const string& str ) const
  15. {
  16. vector<string> tokens ;
  17. const char* cstr = str.c_str() ;
  18. size_t n = 0 ;
  19. while( n < str.size() )
  20. n += parse( cstr + n,
  21. (+space_p) [ push_back_a( tokens, "SPACE" ) ] |
  22. str_p("daniweb") [ push_back_a( tokens, "WEB" ) ] |
  23. str_p("lexer") [ push_back_a( tokens, "LEX" ) ] |
  24. str_p("tokenizer") [ push_back_a( tokens, "TOK" ) ] |
  25. (+~space_p) [ push_back_a( tokens, "STRING" ) ]
  26. ).length ;
  27. cout << '\n' << "parsed: " << str << "\ntokens: " ;
  28. copy( tokens.begin(), tokens.end(),
  29. ostream_iterator<string>(cout," ") ) ;
  30. cout << '\n' ;
  31. }
  32. };
  33. int main()
  34. {
  35. vector<string> test_cases = list_of
  36. ( "test daniweb lexer xyz tokenizer lexer" )
  37. ( "daniweblexer tokenizerlexer abcd lexerlexer" )
  38. ( "daniwebtest lexerdaniweblexertest tokenizerxxx" ) ;
  39. for_each( test_cases.begin(), test_cases.end(), parse_it() ) ;
  40. }
  41. /**
  42. >g++ -Wall -std=c++98 -I/usr/local/include keyword.cpp && ./a.out
  43.  
  44. parsed: test daniweb lexer xyz tokenizer lexer
  45. tokens: STRING SPACE WEB SPACE LEX SPACE STRING SPACE TOK SPACE LEX
  46.  
  47. parsed: daniweblexer tokenizerlexer abcd lexerlexer
  48. tokens: WEB LEX SPACE TOK LEX SPACE STRING SPACE LEX LEX
  49.  
  50. parsed: daniwebtest lexerdaniweblexertest tokenizerxxx
  51. tokens: WEB STRING SPACE LEX WEB LEX STRING SPACE TOK STRING
  52. */
Last edited by vijayan121 : Aug 30th, 2007 at 1:55 pm.
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.



Similar Threads
Other Threads in the C++ Forum
Views: 1457 | Replies: 11 | Currently Viewing: 1 (0 members and 1 guests)

 

Thread Tools Display Modes
Forums | Blogs | Tutorials | Code Snippets | Whitepapers | RSS Feeds | Advertising
All times are GMT -4. The time now is 2:44 pm.
Newsletter Archive - Sitemap - Privacy Statement - Acceptable Use Policy - Contact Us
Forum system based on vBulletin Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC