Splitting a string into tokens

Please support our C++ advertiser: Intel Parallel Studio Home
Reply

Join Date: Dec 2008
Posts: 57
Reputation: AdRock is an unknown quantity at this point 
Solved Threads: 0
AdRock AdRock is offline Offline
Junior Poster in Training

Splitting a string into tokens

 
0
  #1
33 Days Ago
I have a string and I need to split the string into tokens

The word can contain and letter/number but can also contain punctation such as brackets

I need to be able to find an occurence of punctuation, copy the string up to that point as a token, copy the puctuation mark as a token etc until it reaches the end of the string

Here is my output for the string dressed(with)
vector 0: dressed
vector 1: (with)
vector 2: dressed(with
vector 3: )
vectors 0 and 3 are correct but the output should be
vector 0: dressed
vector 1: (
vector 2: with
vector 3: )
here is my code
  1. #include <string>
  2. #include <iostream>
  3. #include <vector>
  4.  
  5. using namespace std;
  6.  
  7. int main()
  8. {
  9. vector <string> vec;
  10. string str = "dressed(with)";
  11. string tmp;
  12.  
  13. char punct[] = {'+','-','*','/','<','=','!','>','{','(',')','}',';',','};
  14.  
  15. for (int i=0; i < sizeof(punct); i++)
  16. {
  17. unsigned int pos = str.find(punct[i], 0);
  18.  
  19. if(pos != string::npos)
  20. {
  21. tmp.assign(str, 0, pos);
  22. vec.push_back(tmp);
  23. tmp.assign(str, pos, pos);
  24. vec.push_back(tmp);
  25. }
  26. }
  27.  
  28. for(int a=0; a < vec.size(); a++)
  29. {
  30. cout << "vector " << a << ": " << vec.at(a) << endl;
  31. }
  32.  
  33. return 0;
  34. }
Reply With Quote Quick reply to this message  
Join Date: Jun 2009
Posts: 681
Reputation: Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of 
Solved Threads: 132
Tom Gunn's Avatar
Tom Gunn Tom Gunn is offline Offline
Practically a Master Poster
 
2
  #2
33 Days Ago
Your search always starts from position 0, that is why tokens are being duplicated. You need to skip over the extracted tokens as you go. Something like this:
  1. #include <iostream>
  2. #include <string>
  3. #include <vector>
  4.  
  5. int main()
  6. {
  7. using namespace std;
  8.  
  9. string const punct = "+-*/<=!>{()};,";
  10.  
  11. string str = "dressed(with)";
  12. string::size_type pos = 0;
  13. vector<string> vec;
  14.  
  15. while (pos != string::npos)
  16. {
  17. string::size_type end = str.find_first_of(punct, pos);
  18.  
  19. if (end == pos) end = str.find_first_not_of(punct, pos);
  20.  
  21. vec.push_back(str.substr(pos, end - pos));
  22. pos = end;
  23. }
  24.  
  25. for (int a = 0; a < vec.size(); ++a)
  26. {
  27. cout << "vector " << a << ": " << vec.at(a) << '\n';
  28. }
  29.  
  30. return 0;
  31. }
-Tommy (For Great Justice!) Gunn
Reply With Quote Quick reply to this message  
Join Date: Jan 2008
Posts: 3,814
Reputation: VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute 
Solved Threads: 501
Featured Poster
VernonDozier VernonDozier is offline Offline
Senior Poster
 
0
  #3
33 Days Ago
If I recall correctly (from a previous thread), the OP also wants each punctuation element to be in its own vector (OP, please clarify), so if the string was "dressed(with))", the OP wants this:
  1. dressed
  2. (
  3. with
  4. )
  5. )

rather than
  1. dressed
  2. (
  3. with
  4. ))

so the OP would have to take it another step or two and further split strings with multiple consecutive punctuation characters into separate strings (again, that's if I was interpreting correctly from earlier threads). But Tom Gunn's code gets you one step closer regardless!
Reply With Quote Quick reply to this message  
Join Date: Dec 2008
Posts: 57
Reputation: AdRock is an unknown quantity at this point 
Solved Threads: 0
AdRock AdRock is offline Offline
Junior Poster in Training
 
0
  #4
33 Days Ago
Originally Posted by VernonDozier View Post
If I recall correctly (from a previous thread), the OP also wants each punctuation element to be in its own vector (OP, please clarify), so if the string was "dressed(with))", the OP wants this:
  1. dressed
  2. (
  3. with
  4. )
  5. )

rather than
  1. dressed
  2. (
  3. with
  4. ))

so the OP would have to take it another step or two and further split strings with multiple consecutive punctuation characters into separate strings (again, that's if I was interpreting correctly from earlier threads). But Tom Gunn's code gets you one step closer regardless!
Yes....that is a good point are you are right. there may be an occurence where that would be needed. Every punctuation mark should be in it's own vector.

How would i rewrite that?
Last edited by AdRock; 33 Days Ago at 11:14 am.
Reply With Quote Quick reply to this message  
Join Date: Jun 2009
Posts: 681
Reputation: Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of 
Solved Threads: 132
Tom Gunn's Avatar
Tom Gunn Tom Gunn is offline Offline
Practically a Master Poster
 
0
  #5
33 Days Ago
How would i rewrite that?
Give it a try before asking for help. Depending on your experience solving problems, at least an hour to several hours of solid work at it should be the minimum. Honestly, if somebody tells you how to write the code every time, you will not learn anything substantial and you will end up asking for help with everything.
-Tommy (For Great Justice!) Gunn
Reply With Quote Quick reply to this message  
Join Date: Dec 2008
Posts: 57
Reputation: AdRock is an unknown quantity at this point 
Solved Threads: 0
AdRock AdRock is offline Offline
Junior Poster in Training
 
0
  #6
33 Days Ago
Thanks

I've just solved another problem I've had for ages and that's splitting the strings of each line into tokens

This is where my current problem leads onto where i need to split each signle string into tokens
Reply With Quote Quick reply to this message  
Join Date: Dec 2008
Posts: 57
Reputation: AdRock is an unknown quantity at this point 
Solved Threads: 0
AdRock AdRock is offline Offline
Junior Poster in Training
 
0
  #7
32 Days Ago
I am struggling to come up with a solution for this as everything i have tried either gets the same output or the program crashes.

This is how i understand it

  1. string::size_type end = str.find_first_of(punct, pos);
assign to the variable end where the first occurrence of any of the puncs starting from the first char

  1. if (end == pos)
if the end variable is 0 then

  1. end = str.find_first_not_of(punct, pos);
assign to the variable end where the first occurrence of any non puncs starting from the first char

It then loops around starting at the new pos which using this string
(dressed(with))
which would be 0,18,9,13 but stops at 13

Do i have to perform another loop inside of the while loop and have the vec.push_back inside?

The way i thought it would be done is it finds a punc at pos whatever and then it should go through the while loop again
Reply With Quote Quick reply to this message  
Join Date: Jan 2008
Posts: 3,814
Reputation: VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute VernonDozier has a reputation beyond repute 
Solved Threads: 501
Featured Poster
VernonDozier VernonDozier is offline Offline
Senior Poster
 
1
  #8
32 Days Ago
Here's what I think you should do. Don't worry for now about why/how the code did what it did (obviously, you can and probably should go over it later and figure out what it does and why for your own personal knowlege). Test it out with all sorts of input and make sure it does what it's supposed to (break strings into "all non-punctuation" and "all punctuation" strings). If it does work for all possible test cases, go to the next step.

  1. while (pos != string::npos)
  2. {
  3. string::size_type end = str.find_first_of(punct, pos);
  4.  
  5. if (end == pos) end = str.find_first_not_of(punct, pos);
  6.  
  7. vec.push_back(str.substr(pos, end - pos));
  8. pos = end;
  9. }

Break line 7 into two lines:

  1. string newString = str.substr (pos, end - pos);
  2. vec.push_back (newString);

So you end up with this:

  1. while (pos != string::npos)
  2. {
  3. string::size_type end = str.find_first_of(punct, pos);
  4.  
  5. if (end == pos) end = str.find_first_not_of(punct, pos);
  6.  
  7. string newString = str.substr (pos, end - pos);
  8. vec.push_back (newString);
  9. pos = end;
  10. }

Now, if newString contains punctuation, you need to change it into one string for every character. If it doesn't, push the whole string as you do in line 8 above. Test whether it has any punctuation in it, as before, and act accordingly (push it onto the vector if it's all non-punctuation, split it further if it is punctuation):

  1. if (newString.find_first_of (punct) == string::npos)
  2. {
  3. // newString doesn't contain punctuation. Push it.
  4. vec.push_back (newString);
  5. }
  6. else
  7. {
  8. // newString is punctuation. Break newString into one-character strings and push each of them onto vec.
  9. }


So your job is:
  1. Try Tom Gunn's code out. Make sure it "works" for all possible test cases (i.e. change line 11 below for every possible test case you can think of and make sure the code "behaves". I imagine it does. Tom Gunn's code generally does. . But you need to verify that.
  2. If it does, look at my revised code below. Run it. See what it does. now delete my line 31. Change line 30 so it does what you need it to do, which is to take a string like "****))" strored in newString and break it into six separate strings, and push them all onto the vec vector

  1. #include <iostream>
  2. #include <string>
  3. #include <vector>
  4.  
  5. int main()
  6. {
  7. using namespace std;
  8.  
  9. string const punct = "+-*/<=!>{()};,";
  10.  
  11. string str = "dressed(to**!impress{{)";
  12. string::size_type pos = 0;
  13. vector<string> vec;
  14.  
  15. while (pos != string::npos)
  16. {
  17. string::size_type end = str.find_first_of(punct, pos);
  18.  
  19. if (end == pos) end = str.find_first_not_of(punct, pos);
  20.  
  21. string newString = str.substr (pos, end - pos);
  22.  
  23. if (newString.find_first_of (punct) == string::npos)
  24. {
  25. // newString doesn't contain punctuation. Push it.
  26. vec.push_back (newString);
  27. }
  28. else
  29. {
  30. // newString is punctuation. Break newString into one-character strings and push each of them onto vec.
  31. vec.push_back ("PUNCTUATION");
  32. }
  33.  
  34. pos = end;
  35. }
  36.  
  37. for (int a = 0; a < vec.size(); ++a)
  38. {
  39. cout << "vector " << a << ": " << vec.at(a) << '\n';
  40. }
  41.  
  42. return 0;
  43. }
Reply With Quote Quick reply to this message  
Join Date: Jun 2009
Posts: 681
Reputation: Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of Tom Gunn has much to be proud of 
Solved Threads: 132
Tom Gunn's Avatar
Tom Gunn Tom Gunn is offline Offline
Practically a Master Poster
 
1
  #9
32 Days Ago
Try Tom Gunn's code out. Make sure it "works" for all possible test cases (i.e. change line 11 below for every possible test case you can think of and make sure the code "behaves".
It does not, as you proved. I did not consider adjacent punctuation in my haste to get my post out the door and ended up over engineering the whole thing. Since punctuation is always a single character in this case, the simpler solution for matching punctuation works better:
  1. #include <iostream>
  2. #include <string>
  3. #include <vector>
  4.  
  5. std::vector<std::string> SplitOnPunct(std::string const& str,
  6. std::string const& punct)
  7. {
  8. std::vector<std::string> vec;
  9.  
  10. if (str.length() == 0) return vec;
  11.  
  12. std::string::size_type pos, end;
  13.  
  14. for (pos = 0; pos != std::string::npos; pos = end)
  15. {
  16. end = str.find_first_of(punct, pos);
  17.  
  18. if (end == pos && ++end == str.size()) end = std::string::npos;
  19.  
  20. vec.push_back(str.substr(pos, end - pos));
  21. }
  22.  
  23. return vec;
  24. }
  25.  
  26. int main()
  27. {
  28. std::vector<std::string> vec = SplitOnPunct("dressed(with)", "+-*/<=!>{()};,");
  29.  
  30. for (std::vector<std::string>::size_type a = 0; a < vec.size(); ++a)
  31. {
  32. std::cout << "vector " << a << ": " << vec.at(a) << '\n';
  33. }
  34. }
I still do not guarantee 100% correctness because it is hard to find my own mistakes. All of the basic test cases seem to work though.
-Tommy (For Great Justice!) Gunn
Reply With Quote Quick reply to this message  
Reply

Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC