Parse or Tokenize String

Please support our C++ advertiser: Intel Parallel Studio Home
Thread Solved

Join Date: Jun 2007
Posts: 21
Reputation: Akilah712 is an unknown quantity at this point 
Solved Threads: 0
Akilah712 Akilah712 is offline Offline
Newbie Poster

Parse or Tokenize String

 
0
  #1
Jul 18th, 2007
I am working on a project and I need to process a text file.

I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:

>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM

What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.

I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.

So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far

  1. void processFile (){
  2. string contents;
  3. string fileName;
  4.  
  5. cout << "Enter the file name: ";
  6. getline (cin,fileName);
  7.  
  8. //Open file
  9. ifstream file(fileName.c_str()); // might want to add binary mode here
  10.  
  11. //Read contents of file into a string
  12. stringstream buffer;
  13. buffer << file.rdbuf();
  14. string str(buffer.str());
  15. contents = str.c_str();//entire file
  16.  
  17. //close the file
  18. file.close();
  19.  
  20.  
  21. //Use tokenizer function to get name and sequence sets
  22. //will store the tokes of each name+sequence
  23. vector<string> sets;
  24.  
  25. //get the sets - name+sequence
  26. Tokenize (contents, sets, ">");
  27.  
  28. //stores the splitted names and sequences
  29. vector<string>dna;
  30. //split the sets
  31. for (int x = 0; x < sets.size(); x++){
  32. Tokenize (sets[x], dna, "\n");
  33.  
  34. }
  35.  
  36.  
  37. //store the names
  38. for (int i = 0; i<dna.size();){
  39.  
  40. names.push_back(dna[i]);
  41. i = i + 2;
  42.  
  43. }
  44. //store each sequence
  45. for (int j = 1; j<dna.size();){
  46.  
  47. sequences.push_back(dna[j]);
  48. j = j + 2;
  49. }
  50.  
  51. }//End processFile
  52.  
  53.  
  54. void Tokenize(const string& str,vector<string>& tokens, string del)
  55. {
  56. string delimiters = del;
  57.  
  58. // Skip delimiters at beginning.
  59. string::size_type lastPos = str.find_first_not_of(delimiters, 0);
  60. // Find first "non-delimiter".
  61. string::size_type pos = str.find_first_of(delimiters, lastPos);
  62.  
  63. while (string::npos != pos || string::npos != lastPos)
  64. {
  65. // Found a token, add it to the vector.
  66. tokens.push_back(str.substr(lastPos, pos - lastPos));
  67. // Skip delimiters. Note the "not_of"
  68. lastPos = str.find_first_not_of(delimiters, pos);
  69. // Find next "non-delimiter"
  70. pos = str.find_first_of(delimiters, lastPos);
  71. }
  72. }


The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.
Last edited by Akilah712; Jul 18th, 2007 at 7:56 pm.
Reply With Quote Quick reply to this message  
Join Date: Feb 2006
Posts: 486
Reputation: Bench has a spectacular aura about Bench has a spectacular aura about Bench has a spectacular aura about 
Solved Threads: 48
Bench's Avatar
Bench Bench is offline Offline
Posting Pro in Training

Re: Parse or Tokenize String

 
0
  #2
Jul 18th, 2007
Why are you copying the file contents out of a filestream, into a stringstream, then, into a string, and then, into another string?

you can extract each line of the file one by one straight into your vector, without all that fuss...
  1. #include <iostream>
  2. #include <vector>
  3. #include <fstream>
  4. #include <string>
  5.  
  6. using namespace std;
  7.  
  8. int main()
  9. {
  10. ifstream fs("test.txt");
  11. string input;
  12. vector<string> sets;
  13. while( getline(fs, input) )
  14. sets.push_back(input);
  15. }
All which remains is to work out which elements in your vector are names (the ones which start with '<' )
¿umop apisdn upside down?
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 21
Reputation: Akilah712 is an unknown quantity at this point 
Solved Threads: 0
Akilah712 Akilah712 is offline Offline
Newbie Poster

Re: Parse or Tokenize String

 
0
  #3
Jul 19th, 2007
I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????
Reply With Quote Quick reply to this message  
Join Date: May 2006
Posts: 19
Reputation: Jessehk is an unknown quantity at this point 
Solved Threads: 2
Jessehk's Avatar
Jessehk Jessehk is offline Offline
Newbie Poster

Re: Parse or Tokenize String

 
0
  #4
Jul 19th, 2007
I don't exactly understand the question, but have you considered using something like boost::tokenizer?

http://boost.org/libs/tokenizer/index.html
--Jessehk
Reply With Quote Quick reply to this message  
Join Date: Feb 2006
Posts: 486
Reputation: Bench has a spectacular aura about Bench has a spectacular aura about Bench has a spectacular aura about 
Solved Threads: 48
Bench's Avatar
Bench Bench is offline Offline
Posting Pro in Training

Re: Parse or Tokenize String

 
0
  #5
Jul 19th, 2007
Originally Posted by Akilah712 View Post
I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????
That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
¿umop apisdn upside down?
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 21
Reputation: Akilah712 is an unknown quantity at this point 
Solved Threads: 0
Akilah712 Akilah712 is offline Offline
Newbie Poster

Re: Parse or Tokenize String

 
0
  #6
Jul 19th, 2007
Originally Posted by Bench View Post
That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
No luck here.

I can isolate the names and that's it.

I want an array of names and an array of sequences.

Names[0] = name1

Names[1] = name2

Sequences[0] = ABC...
Sequences[1] = ABC....

I just want to parse the text file at the ">" symbol.
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,089
Reputation: vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all 
Solved Threads: 164
vijayan121 vijayan121 is offline Offline
Veteran Poster

Re: Parse or Tokenize String

 
1
  #7
Jul 20th, 2007
  1. #include <fstream>
  2. #include <string>
  3. #include <vector>
  4. #include <cassert>
  5. using namespace std ;
  6.  
  7. int main()
  8. {
  9. const char DELIM = '>' ;
  10. const char* const file_name = "whatever" ;
  11. ifstream file( file_name ) ; assert(file) ;
  12. vector<string> names, sequences ;
  13. string line ;
  14.  
  15. // skip lines till we get one starting with DELIM
  16. while( getline(file, line) )
  17. if( !line.empty() && line[0]==DELIM ) break ;
  18.  
  19. names.push_back( line.substr(1) ) ;
  20. string charseq ;
  21. while( getline(file, line) )
  22. {
  23. if( !line.empty() && line[0] == DELIM )
  24. {
  25. sequences.push_back(charseq) ;
  26. charseq.clear() ;
  27. names.push_back( line.substr(1) ) ;
  28. }
  29. else
  30. charseq += line + '\n' ;
  31. }
  32. sequences.push_back(charseq) ;
  33. }
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 21
Reputation: Akilah712 is an unknown quantity at this point 
Solved Threads: 0
Akilah712 Akilah712 is offline Offline
Newbie Poster

Re: Parse or Tokenize String

 
0
  #8
Jul 20th, 2007
Thanks.

It worked, but I had to change charseq.clear() to charseq.erase(0, charseq.length());

My compiler gave me an error for clear. Said it's not part of the basic string library.

Thanks again!!!
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the C++ Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC