944,101 Members | Top Members by Rank

Ad:
  • C++ Discussion Thread
  • Marked Solved
  • Views: 28402
  • C++ RSS
Jul 18th, 2007
0

Parse or Tokenize String

Expand Post »
I am working on a project and I need to process a text file.

I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:

>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM

What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.

I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.

So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far

C++ Syntax (Toggle Plain Text)
  1. void processFile (){
  2. string contents;
  3. string fileName;
  4.  
  5. cout << "Enter the file name: ";
  6. getline (cin,fileName);
  7.  
  8. //Open file
  9. ifstream file(fileName.c_str()); // might want to add binary mode here
  10.  
  11. //Read contents of file into a string
  12. stringstream buffer;
  13. buffer << file.rdbuf();
  14. string str(buffer.str());
  15. contents = str.c_str();//entire file
  16.  
  17. //close the file
  18. file.close();
  19.  
  20.  
  21. //Use tokenizer function to get name and sequence sets
  22. //will store the tokes of each name+sequence
  23. vector<string> sets;
  24.  
  25. //get the sets - name+sequence
  26. Tokenize (contents, sets, ">");
  27.  
  28. //stores the splitted names and sequences
  29. vector<string>dna;
  30. //split the sets
  31. for (int x = 0; x < sets.size(); x++){
  32. Tokenize (sets[x], dna, "\n");
  33.  
  34. }
  35.  
  36.  
  37. //store the names
  38. for (int i = 0; i<dna.size();){
  39.  
  40. names.push_back(dna[i]);
  41. i = i + 2;
  42.  
  43. }
  44. //store each sequence
  45. for (int j = 1; j<dna.size();){
  46.  
  47. sequences.push_back(dna[j]);
  48. j = j + 2;
  49. }
  50.  
  51. }//End processFile
  52.  
  53.  
  54. void Tokenize(const string& str,vector<string>& tokens, string del)
  55. {
  56. string delimiters = del;
  57.  
  58. // Skip delimiters at beginning.
  59. string::size_type lastPos = str.find_first_not_of(delimiters, 0);
  60. // Find first "non-delimiter".
  61. string::size_type pos = str.find_first_of(delimiters, lastPos);
  62.  
  63. while (string::npos != pos || string::npos != lastPos)
  64. {
  65. // Found a token, add it to the vector.
  66. tokens.push_back(str.substr(lastPos, pos - lastPos));
  67. // Skip delimiters. Note the "not_of"
  68. lastPos = str.find_first_not_of(delimiters, pos);
  69. // Find next "non-delimiter"
  70. pos = str.find_first_of(delimiters, lastPos);
  71. }
  72. }


The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.
Last edited by Akilah712; Jul 18th, 2007 at 7:56 pm.
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
Akilah712 is offline Offline
21 posts
since Jun 2007
Jul 18th, 2007
0

Re: Parse or Tokenize String

Why are you copying the file contents out of a filestream, into a stringstream, then, into a string, and then, into another string?

you can extract each line of the file one by one straight into your vector, without all that fuss...
CPP Syntax (Toggle Plain Text)
  1. #include <iostream>
  2. #include <vector>
  3. #include <fstream>
  4. #include <string>
  5.  
  6. using namespace std;
  7.  
  8. int main()
  9. {
  10. ifstream fs("test.txt");
  11. string input;
  12. vector<string> sets;
  13. while( getline(fs, input) )
  14. sets.push_back(input);
  15. }
All which remains is to work out which elements in your vector are names (the ones which start with '<' )
Reputation Points: 307
Solved Threads: 62
Posting Pro
Bench is offline Offline
565 posts
since Feb 2006
Jul 19th, 2007
0

Re: Parse or Tokenize String

I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????
Reputation Points: 10
Solved Threads: 0
Newbie Poster
Akilah712 is offline Offline
21 posts
since Jun 2007
Jul 19th, 2007
0

Re: Parse or Tokenize String

I don't exactly understand the question, but have you considered using something like boost::tokenizer?

http://boost.org/libs/tokenizer/index.html
Reputation Points: 33
Solved Threads: 2
Newbie Poster
Jessehk is offline Offline
19 posts
since May 2006
Jul 19th, 2007
0

Re: Parse or Tokenize String

Click to Expand / Collapse  Quote originally posted by Akilah712 ...
I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????
That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
Reputation Points: 307
Solved Threads: 62
Posting Pro
Bench is offline Offline
565 posts
since Feb 2006
Jul 19th, 2007
0

Re: Parse or Tokenize String

Click to Expand / Collapse  Quote originally posted by Bench ...
That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
No luck here.

I can isolate the names and that's it.

I want an array of names and an array of sequences.

Names[0] = name1

Names[1] = name2

Sequences[0] = ABC...
Sequences[1] = ABC....

I just want to parse the text file at the ">" symbol.
Reputation Points: 10
Solved Threads: 0
Newbie Poster
Akilah712 is offline Offline
21 posts
since Jun 2007
Jul 20th, 2007
1

Re: Parse or Tokenize String

C++ Syntax (Toggle Plain Text)
  1. #include <fstream>
  2. #include <string>
  3. #include <vector>
  4. #include <cassert>
  5. using namespace std ;
  6.  
  7. int main()
  8. {
  9. const char DELIM = '>' ;
  10. const char* const file_name = "whatever" ;
  11. ifstream file( file_name ) ; assert(file) ;
  12. vector<string> names, sequences ;
  13. string line ;
  14.  
  15. // skip lines till we get one starting with DELIM
  16. while( getline(file, line) )
  17. if( !line.empty() && line[0]==DELIM ) break ;
  18.  
  19. names.push_back( line.substr(1) ) ;
  20. string charseq ;
  21. while( getline(file, line) )
  22. {
  23. if( !line.empty() && line[0] == DELIM )
  24. {
  25. sequences.push_back(charseq) ;
  26. charseq.clear() ;
  27. names.push_back( line.substr(1) ) ;
  28. }
  29. else
  30. charseq += line + '\n' ;
  31. }
  32. sequences.push_back(charseq) ;
  33. }
Reputation Points: 1159
Solved Threads: 285
Posting Virtuoso
vijayan121 is offline Offline
1,606 posts
since Dec 2006
Jul 20th, 2007
0

Re: Parse or Tokenize String

Thanks.

It worked, but I had to change charseq.clear() to charseq.erase(0, charseq.length());

My compiler gave me an error for clear. Said it's not part of the basic string library.

Thanks again!!!
Reputation Points: 10
Solved Threads: 0
Newbie Poster
Akilah712 is offline Offline
21 posts
since Jun 2007

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in C++ Forum Timeline: Redundancy
Next Thread in C++ Forum Timeline: C++ I/o





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC