954,487 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Parse or Tokenize String

I am working on a project and I need to process a text file.

I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:

>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM

What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.

I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.

So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far

void processFile (){
	string contents;
	string fileName;

	cout << "Enter the file name: ";
	getline (cin,fileName);
	
	//Open file
	ifstream file(fileName.c_str());    // might want to add binary mode here
	
	//Read contents of file into a string
	stringstream buffer;
	buffer << file.rdbuf();	
	string str(buffer.str()); 
	contents = str.c_str();//entire file
	
       //close the file	
	file.close();
	
	
	//Use tokenizer function to get name and sequence sets		
	//will store the tokes of each name+sequence
	vector<string> sets;

	//get the sets - name+sequence
	Tokenize (contents, sets, ">");
	
	//stores the splitted names and sequences
	vector<string>dna;
	//split the sets	
	for (int x = 0; x < sets.size(); x++){
		Tokenize (sets[x], dna, "\n");		
			
	}		
	

	//store the names
	for (int i = 0; i<dna.size();){
	
		names.push_back(dna[i]);
		i = i + 2;
		
	}
	//store each sequence
	for (int j = 1; j<dna.size();){
	
		sequences.push_back(dna[j]);
		j = j + 2;
	}

}//End processFile


void Tokenize(const string& str,vector<string>& tokens, string del)
{
    string delimiters = del;
	
	// Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.

Akilah712
Newbie Poster
21 posts since Jun 2007
Reputation Points: 10
Solved Threads: 0
 

Why are you copying the file contents out of a filestream, into a stringstream, then, into a string, and then, into another string?

you can extract each line of the file one by one straight into your vector, without all that fuss...

#include <iostream>
#include <vector>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    ifstream fs("test.txt");
    string input;
    vector<string> sets;
    while( getline(fs, input) )
        sets.push_back(input);
}


All which remains is to work out which elements in your vector are names (the ones which start with '<' )

Bench
Posting Pro
577 posts since Feb 2006
Reputation Points: 307
Solved Threads: 63
 

I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????

Akilah712
Newbie Poster
21 posts since Jun 2007
Reputation Points: 10
Solved Threads: 0
 

I don't exactly understand the question, but have you considered using something like boost::tokenizer?

http://boost.org/libs/tokenizer/index.html

Jessehk
Newbie Poster
20 posts since May 2006
Reputation Points: 33
Solved Threads: 2
 

I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example. >Name ABCDEFG HIJKLMNO

I have to extract the name, and then the letters must be stored together????

That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.

Bench
Posting Pro
577 posts since Feb 2006
Reputation Points: 307
Solved Threads: 63
 

That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.

No luck here.

I can isolate the names and that's it.

I want an array of names and an array of sequences.

Names[0] = name1

Names[1] = name2

Sequences[0] = ABC...
Sequences[1] = ABC....

I just want to parse the text file at the ">" symbol.

Akilah712
Newbie Poster
21 posts since Jun 2007
Reputation Points: 10
Solved Threads: 0
 
#include <fstream>
#include <string>
#include <vector>
#include <cassert>
using namespace std ;

int main()
{
  const char DELIM = '>' ;
  const char* const file_name = "whatever" ;
  ifstream file( file_name ) ; assert(file) ;
  vector<string> names, sequences ;
  string line ; 

  // skip lines till we get one starting with DELIM
  while( getline(file, line) ) 
    if( !line.empty() && line[0]==DELIM ) break ;

  names.push_back( line.substr(1) ) ;
  string charseq ;
  while( getline(file, line) )
  {
    if( !line.empty() && line[0] == DELIM )
    {
      sequences.push_back(charseq) ;
      charseq.clear() ;
      names.push_back( line.substr(1) ) ;
    }
    else
     charseq += line + '\n' ;
  }
  sequences.push_back(charseq) ;
}
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
 

Thanks.

It worked, but I had to change charseq.clear() to charseq.erase(0, charseq.length());

My compiler gave me an error for clear. Said it's not part of the basic string library.

Thanks again!!!

Akilah712
Newbie Poster
21 posts since Jun 2007
Reputation Points: 10
Solved Threads: 0
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You