| | |
Parse or Tokenize String
Please support our C++ advertiser: Intel Parallel Studio Home
Thread Solved |
•
•
Join Date: Jun 2007
Posts: 21
Reputation:
Solved Threads: 0
I am working on a project and I need to process a text file.
I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:
>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM
What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.
I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.
So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far
The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.
I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:
>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM
What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.
I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.
So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far
C++ Syntax (Toggle Plain Text)
void processFile (){ string contents; string fileName; cout << "Enter the file name: "; getline (cin,fileName); //Open file ifstream file(fileName.c_str()); // might want to add binary mode here //Read contents of file into a string stringstream buffer; buffer << file.rdbuf(); string str(buffer.str()); contents = str.c_str();//entire file //close the file file.close(); //Use tokenizer function to get name and sequence sets //will store the tokes of each name+sequence vector<string> sets; //get the sets - name+sequence Tokenize (contents, sets, ">"); //stores the splitted names and sequences vector<string>dna; //split the sets for (int x = 0; x < sets.size(); x++){ Tokenize (sets[x], dna, "\n"); } //store the names for (int i = 0; i<dna.size();){ names.push_back(dna[i]); i = i + 2; } //store each sequence for (int j = 1; j<dna.size();){ sequences.push_back(dna[j]); j = j + 2; } }//End processFile void Tokenize(const string& str,vector<string>& tokens, string del) { string delimiters = del; // Skip delimiters at beginning. string::size_type lastPos = str.find_first_not_of(delimiters, 0); // Find first "non-delimiter". string::size_type pos = str.find_first_of(delimiters, lastPos); while (string::npos != pos || string::npos != lastPos) { // Found a token, add it to the vector. tokens.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters. Note the "not_of" lastPos = str.find_first_not_of(delimiters, pos); // Find next "non-delimiter" pos = str.find_first_of(delimiters, lastPos); } }
The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.
Last edited by Akilah712; Jul 18th, 2007 at 7:56 pm.
Why are you copying the file contents out of a filestream, into a stringstream, then, into a string, and then, into another string?
you can extract each line of the file one by one straight into your vector, without all that fuss... All which remains is to work out which elements in your vector are names (the ones which start with '<' )
you can extract each line of the file one by one straight into your vector, without all that fuss...
CPP Syntax (Toggle Plain Text)
#include <iostream> #include <vector> #include <fstream> #include <string> using namespace std; int main() { ifstream fs("test.txt"); string input; vector<string> sets; while( getline(fs, input) ) sets.push_back(input); }
¿umop apisdn upside down? I don't exactly understand the question, but have you considered using something like boost::tokenizer?
http://boost.org/libs/tokenizer/index.html
http://boost.org/libs/tokenizer/index.html
--Jessehk
•
•
•
•
I tried that. However it only made things more difficult.
It split the contents of the file into individual lines therefore splitting up the information that I need.
Example.
>Name
ABCDEFG
HIJKLMNO
I have to extract the name, and then the letters must be stored together????
There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
¿umop apisdn upside down? •
•
Join Date: Jun 2007
Posts: 21
Reputation:
Solved Threads: 0
•
•
•
•
That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.
There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.
I can isolate the names and that's it.
I want an array of names and an array of sequences.
Names[0] = name1
Names[1] = name2
Sequences[0] = ABC...
Sequences[1] = ABC....
I just want to parse the text file at the ">" symbol.
•
•
Join Date: Dec 2006
Posts: 1,089
Reputation:
Solved Threads: 164
C++ Syntax (Toggle Plain Text)
#include <fstream> #include <string> #include <vector> #include <cassert> using namespace std ; int main() { const char DELIM = '>' ; const char* const file_name = "whatever" ; ifstream file( file_name ) ; assert(file) ; vector<string> names, sequences ; string line ; // skip lines till we get one starting with DELIM while( getline(file, line) ) if( !line.empty() && line[0]==DELIM ) break ; names.push_back( line.substr(1) ) ; string charseq ; while( getline(file, line) ) { if( !line.empty() && line[0] == DELIM ) { sequences.push_back(charseq) ; charseq.clear() ; names.push_back( line.substr(1) ) ; } else charseq += line + '\n' ; } sequences.push_back(charseq) ; }
![]() |
Similar Threads
Other Threads in the C++ Forum
- Previous Thread: Redundancy
- Next Thread: C++ Web Browser
| Thread Tools | Search this Thread |
api array based beginner binary bitmap c++ c/c++ calculator char char* class code coding compile compiler console conversion count database delete deploy developer directshow dll download dynamic dynamiccharacterarray email encryption error file forms fstream function functions game givemetehcodez google graph gui homeworkhelp homeworkhelper iamthwee ifstream input int java lib linkedlist linker list loop looping loops map math matrix memory multiple news node number numbertoword output parameter pointer problem program programming project proxy python random read recursion recursive reference rpg sorting string strings temperature template test text text-file tree unix url variable vector video visualstudio win32 windows winsock word wordfrequency wxwidgets






