I have a string and I need to split the string into tokens

The word can contain and letter/number but can also contain punctation such as brackets

I need to be able to find an occurence of punctuation, copy the string up to that point as a token, copy the puctuation mark as a token etc until it reaches the end of the string

Here is my output for the string dressed(with)

vector 0: dressed
vector 1: (with)
vector 2: dressed(with
vector 3: )

vectors 0 and 3 are correct but the output should be

vector 0: dressed
vector 1: (
vector 2: with
vector 3: )

here is my code

#include <string>
#include <iostream>
#include <vector>

using namespace std;

int main()
{
	vector <string> vec;
	string str = "dressed(with)";
	string tmp;
	
	char punct[] = {'+','-','*','/','<','=','!','>','{','(',')','}',';',','};

	for (int i=0; i < sizeof(punct); i++)
	{
		unsigned int pos = str.find(punct[i], 0);

		if(pos != string::npos)
		{
			tmp.assign(str, 0, pos);
			vec.push_back(tmp);
			tmp.assign(str, pos, pos);
			vec.push_back(tmp);
		}	
	}
	
	for(int a=0; a < vec.size(); a++)
	{
		cout << "vector " << a << ": " << vec.at(a) << endl; 
	}

	return 0;
}

Your search always starts from position 0, that is why tokens are being duplicated. You need to skip over the extracted tokens as you go. Something like this:

#include <iostream>
#include <string>
#include <vector>

int main()
{
    using namespace std;

    string const punct = "+-*/<=!>{()};,";

    string str = "dressed(with)";
    string::size_type pos = 0;
    vector<string> vec;

    while (pos != string::npos)
    {
        string::size_type end = str.find_first_of(punct, pos);

        if (end == pos) end = str.find_first_not_of(punct, pos);

        vec.push_back(str.substr(pos, end - pos));
        pos = end;
    }

    for (int a = 0; a < vec.size(); ++a)
    {
        cout << "vector " << a << ": " << vec.at(a) << '\n'; 
    }

    return 0;
}
Comments
thank you for your help
Nice snippet.

If I recall correctly (from a previous thread), the OP also wants each punctuation element to be in its own vector (OP, please clarify), so if the string was "dressed(with))", the OP wants this:

  1. dressed
  2. (
  3. with
  4. )
  5. )

rather than

  1. dressed
  2. (
  3. with
  4. ))

so the OP would have to take it another step or two and further split strings with multiple consecutive punctuation characters into separate strings (again, that's if I was interpreting correctly from earlier threads). But Tom Gunn's code gets you one step closer regardless!

If I recall correctly (from a previous thread), the OP also wants each punctuation element to be in its own vector (OP, please clarify), so if the string was "dressed(with))", the OP wants this:

  1. dressed
  2. (
  3. with
  4. )
  5. )

rather than

  1. dressed
  2. (
  3. with
  4. ))

so the OP would have to take it another step or two and further split strings with multiple consecutive punctuation characters into separate strings (again, that's if I was interpreting correctly from earlier threads). But Tom Gunn's code gets you one step closer regardless!

Yes....that is a good point are you are right. there may be an occurence where that would be needed. Every punctuation mark should be in it's own vector.

How would i rewrite that?

Edited 7 Years Ago by AdRock: n/a

How would i rewrite that?

Give it a try before asking for help. Depending on your experience solving problems, at least an hour to several hours of solid work at it should be the minimum. Honestly, if somebody tells you how to write the code every time, you will not learn anything substantial and you will end up asking for help with everything.

Thanks

I've just solved another problem I've had for ages and that's splitting the strings of each line into tokens

This is where my current problem leads onto where i need to split each signle string into tokens

I am struggling to come up with a solution for this as everything i have tried either gets the same output or the program crashes.

This is how i understand it

string::size_type end = str.find_first_of(punct, pos);

assign to the variable end where the first occurrence of any of the puncs starting from the first char

if (end == pos)

if the end variable is 0 then

end = str.find_first_not_of(punct, pos);

assign to the variable end where the first occurrence of any non puncs starting from the first char

It then loops around starting at the new pos which using this string

(dressed(with))

which would be 0,18,9,13 but stops at 13

Do i have to perform another loop inside of the while loop and have the vec.push_back inside?

The way i thought it would be done is it finds a punc at pos whatever and then it should go through the while loop again

Here's what I think you should do. Don't worry for now about why/how the code did what it did (obviously, you can and probably should go over it later and figure out what it does and why for your own personal knowlege). Test it out with all sorts of input and make sure it does what it's supposed to (break strings into "all non-punctuation" and "all punctuation" strings). If it does work for all possible test cases, go to the next step.

while (pos != string::npos)
    {
        string::size_type end = str.find_first_of(punct, pos);

        if (end == pos) end = str.find_first_not_of(punct, pos);

        vec.push_back(str.substr(pos, end - pos));
        pos = end;
    }

Break line 7 into two lines:

string newString = str.substr (pos, end - pos);
vec.push_back (newString);

So you end up with this:

while (pos != string::npos)
    {
        string::size_type end = str.find_first_of(punct, pos);

        if (end == pos) end = str.find_first_not_of(punct, pos);

        string newString = str.substr (pos, end - pos);
        vec.push_back (newString);
        pos = end;
    }

Now, if newString contains punctuation, you need to change it into one string for every character. If it doesn't, push the whole string as you do in line 8 above. Test whether it has any punctuation in it, as before, and act accordingly (push it onto the vector if it's all non-punctuation, split it further if it is punctuation):

if (newString.find_first_of (punct) == string::npos)
{
   // newString doesn't contain punctuation. Push it.
   vec.push_back (newString);
}
else
{
   // newString is punctuation.  Break newString into one-character strings and push each of them onto vec.
}

So your job is:

  1. Try Tom Gunn's code out. Make sure it "works" for all possible test cases (i.e. change line 11 below for every possible test case you can think of and make sure the code "behaves". I imagine it does. Tom Gunn's code generally does. :). But you need to verify that.
  2. If it does, look at my revised code below. Run it. See what it does. now delete my line 31. Change line 30 so it does what you need it to do, which is to take a string like "****))" strored in newString and break it into six separate strings, and push them all onto the vec vector
#include <iostream>
#include <string>
#include <vector>

int main()
{
    using namespace std;

    string const punct = "+-*/<=!>{()};,";

    string str = "dressed(to**!impress{{)";
    string::size_type pos = 0;
    vector<string> vec;

    while (pos != string::npos)
    {
        string::size_type end = str.find_first_of(punct, pos);

        if (end == pos) end = str.find_first_not_of(punct, pos);

        string newString = str.substr (pos, end - pos);

        if (newString.find_first_of (punct) == string::npos)
        {
            // newString doesn't contain punctuation. Push it.
            vec.push_back (newString);
        }
        else
        {
            // newString is punctuation.  Break newString into one-character strings and push each of them onto vec.
            vec.push_back ("PUNCTUATION");
        }

        pos = end;
    }

    for (int a = 0; a < vec.size(); ++a)
    {
        cout << "vector " << a << ": " << vec.at(a) << '\n';
    }

    return 0;
}
Comments
I really appreciate all your help

Try Tom Gunn's code out. Make sure it "works" for all possible test cases (i.e. change line 11 below for every possible test case you can think of and make sure the code "behaves".

It does not, as you proved. I did not consider adjacent punctuation in my haste to get my post out the door and ended up over engineering the whole thing. Since punctuation is always a single character in this case, the simpler solution for matching punctuation works better:

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> SplitOnPunct(std::string const& str,
                                      std::string const& punct)
{
    std::vector<std::string> vec;

    if (str.length() == 0) return vec;

    std::string::size_type pos, end;

    for (pos = 0; pos != std::string::npos; pos = end)
    {
        end = str.find_first_of(punct, pos);

        if (end == pos && ++end == str.size()) end = std::string::npos;

        vec.push_back(str.substr(pos, end - pos));
    }

    return vec;
}

int main()
{
    std::vector<std::string> vec = SplitOnPunct("dressed(with)", "+-*/<=!>{()};,");

    for (std::vector<std::string>::size_type a = 0; a < vec.size(); ++a)
    {
        std::cout << "vector " << a << ": " << vec.at(a) << '\n'; 
    }
}

I still do not guarantee 100% correctness because it is hard to find my own mistakes. ;) All of the basic test cases seem to work though.

Comments
works a treat....exactly what i needed
This article has been dead for over six months. Start a new discussion instead.