Just to introduce myself again, My name is Shaun and I'm a 21 year old engineering student with about... six weeks of c++ experience.


I main concern isn't my ability to get the job done. I have enough Matlab experience and logical thinking to accomplish this task. My worry is that with all the amazing libraries available for C++, there's probably a MUCH more efficient way of doing this.

I have a text file. The data is arranged in the following format.

<spaces> "Column 1 Data" <spaces> "Column 2 Data" <spaces> "Column 3 Data" <new line>

The number of spaces is not consistent. I need to extract the data from column three and do n-gram analysis on it. If anyone is interested as to what exactly n-grams are and how they're used, I'll be happy to explain. However, for the sake of brevity, I'll just provide an example; this should be sufficient.


For the string "Shaun" I would need to produce

S
h
a
u
n
Sh
ha
au
un
Sha
hau
aun
Shau
haun
Shaun


I should point out that I did NOT stop there because I had reached the length of the word. No matter what the size of the string, I will only break it up into a maximum string length of 5.


So, using a column based approach, I was able to accomplish this using a combination of Matlab and Excel. However, I'd like to do it in Visual Studio C++ 7.1.

My idea is to first use regular expressions to look for a space followed by any number of optional spaces. I'd replace every match with a comma, thus giving me a file delimited by commas and not a varying number of spaces.

Next, I can use the ifstream.get() function to break up the columns, discarding the first and second column and writing the characters in the this column to an object str of the class string, while looking for a \n to stop on.

Once I have str, I can break it up using... some function. This is the part I really need your help on.

Once I have broken it up, I'll store the pieces somewhere (I can do this part later, it's more complicated and is my task for next week) and then loop through again, discarding columns 1 and 2 from the next line and so on.


That's where I stand, I'm installing Boost right now and I'm reading up on the regular expression capabilities.

Thanks!

Recommended Answers

All 23 Replies

Apparently, the regex part is completely unnecessary, as standard cin >> will ignore whitespaces. So, off to part II.

The >> will not work if the string within a given column has any whitespace in it. For example, column 1 is a combination of first name and last name separated by a space, column 2 is an address and column 3 is an ID number. If non of the strings in the colum can have embedded white space then the >> operator will work well.

To break up each string into individual characters you can just access the index number of each char in the string. If the string is Hi! then
str[0] = = 'H';
str[1] == 'i';
str[2] == '!';
str[3] == '\0';

The >> will not work if the string within a given column has any whitespace in it. For example, column 1 is a combination of first name and last name separated by a space, column 2 is an address and column 3 is an ID number. If none of the strings in the colum can have embedded white space then the >> operator will work well.

Yeah, I thought of that. There's no whitespace within the column data. thanks for pointing it out though.

To break up each string into individual characters you can just access the index number of each char in the string. If the string is Hi! then
str[0] = = 'H';
str[1] == 'i';
str[2] == '!';
str[3] == '\0';

I thought of this, it's basically how I did it in Matlab. It seems like this is more of a C based approach, I was wondering if any of the C++ utilities could streamline the process. For instance, if str is an object of the class string, str.remove(pos, length) will remove a segment of a set length from a set starting point. If I could find a function that instead of removing the section, copies the section and assigns it to another string (maybe using a combination of insert and remove?), that would be ideal. Admittedly, I haven't started my own serious exploration of this section yet, I've been writing the rest of the code up to this point, so there might be an easy answer that I'll quickly find. Thanks for the suggestion though.

There is no particularly 'easy' answer, but you can use the STL to do it:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <deque>

// Our dummy class
template <int n>
struct column_n_t: public std::string { };

// Our dummy class input extractor
template <int n>
std::istream& operator >> ( std::istream& ins, column_n_t <n> & c )
  {
  std::string s;
  std::getline( ins, s );
  std::stringstream ss( s );
  for (int i = 0; i < n; i++)
    ss >> s;
  if (!ss) s.clear();
  c.assign( s );
  return ins;
  }

// A convenient function
template <int n>
std::deque <std::string> extract_column_n( std::istream& ins )
  {
  std::deque <std::string> result;

  std::copy(
    std::istream_iterator <column_n_t <n> > ( ins ),
    std::istream_iterator <column_n_t <n> > (),
    std::back_insert_iterator <std::deque <std::string> > ( result )
    );

  return result;
  }

Now to extract column three from the file:

std::deque <std::string> column_3 = extract_column_n <3> ( myfile );

This only works if you know which column you want to extract (3 is a constant) and if you know for sure that whitespace only appears between columns.

If either one of those preconditions are false some slight modifications will need to be made.

Hope this helps.

The closest standard way I can get you is the next_permutation function in the algorhithms header of the STL. I'm sure somebody has written a function to do at least something like this before, and it may be in some third party library (Boost?) but if so, I'm not aware of it. Good luck.

In my head part II can be done with a nested loop using three layers. The outer controls the length of the permutation to be generated. The middle loop controls the starting char of a given permutation. The inner loop actually gets the individual char from the original string to generate each string permutation.

In my head part II can be done with a nested loop using three layers. The outer controls the length of the permutation to be generated. The middle loop controls the starting char of a given permutation. The inner loop actually gets the individual char from the original string to generate each string permutation.

That's sort of what I was thinking, though I think I may be able to combine the second and third loop using string::copy.

Do you think that would work?

Duoas,

That's a long replay, thanks for taking all the time to help me.

However, it'll take me a while to pick through it since I'm very new to this language. Still, I think that understanding that code will be a good measure of progress for a days work. :) I'll get back to you on that after I read a few more chapters.

Once again, thanks for your time.

Hello Shaun32887

following is a small class what has been posted by a programming virtuoso here in forum on daniweb.com. Unfortunately, I have not recorded his name.

//
// String passing
#include <string>
#include <sstream>
#include <iostream>
using namespace std;

class Pars
{
public:
  double bar ( string line )
  {
    stringstream ss ( line );
    string word;
	double x;
    int n = 0;
    while ( ss >> word )
	{
	  istringstream ins;
	  ins.str ( word );
	  ins >> x;
	  cout << n << " " << x << endl;
	  n++;
    }
    return x;
  }
};

int main()
{
  Pars stri;
  string s =  "100 1.745329252 0.984807753 -0.173648178 -5.67128182";
  cout << "Parsing: "<< s << endl;
  stri.bar (s);
  return 0;
}
/* Result
Parsing: 100 1.745329252 0.984807753 -0.173648178 -5.67128182
0 100
1 1.74533
2 0.984808
3 -0.173648
4 -5.67128
*/

As for your permutation problem, I will search my collections. If I make a find, I will send you the code.

krs,
tesu

I find the STL string methods to be a bit cryptic so I don't use them much (they'd probably get less cryptic if I used them more!). You might want to compare substr() and copy() to see which makes the most sense to you.

That's sort of what I was thinking, though I think I may be able to combine the second and third loop using string::copy.

Do you think that would work?

I think that a permutation problem is crying out for recursion.

void Permutations (vector <string>& thePermutations, 
     string startOfString, string stringToPermute)
{  
     // base case. stringToPermute's length = 1
     if (stringToPermute.length() == 1)
     {
          string aString = startOfString + stringToPermute;
          thePermutations.push_back(aString);
     }
     else
     {
          for (int i = 0; i < stringToPermute.length(); i++)
          {
              char aChar = stringToPermute[i];
              string aString = startOfString + aChar;
              string anotherString;
              // make anotherString be stringToPermute
              // minus aChar
              Permutations(thePermutations, aString, anotherString);
          }
     }
}

Didn't mean to write almost the whole function, but I couldn't think of how else to post. If you have a three character string ("dog") and call this function with

vector <string> thePerms;
Permutations(thePerms, "", "dog");

you'll get all the three letter permutations of "dog" (six in all).

Whoops! My bad! You're not really doing permutations, are you? Guess I need to read a little closer next time before posting!.

I was about to :-O if you had actually solved permutations recursively. To iterate is human to recurse is devine as they say...

Shaun, it's likely you've got the parsing issue down by now (I think the CPP String class has built in parsing with substr()) so I'll make the following suggestions:

perhaps there is a string permutator class somewhere out there in the void... mayhaps the google will be able to find it.

elsewise, do you have an algorithm in MATLAB that permutes strings? If so your problem is one merely of translation, and they make robots (MATLAB-Compiler and Schlumberger employees) that can do that now...

elsewise, do you have an algorithm in MATLAB that permutes strings? If so your problem is one merely of translation, and they make robots (MATLAB-Compiler and Schlumberger employees) that can do that now...

Actually, now that you mention it, I DO like the n-gram generating section of my code. The rest of it was very inefficient, but this part was pretty nice.

I'll call my favorite Schlumberger employee later and talk about getting that translator.

You've got two options, you can just re-write the code you used to make n-grams in MATLAB in C++ directly, but if your company provides you with the MATLAB compiler you could also translate your entire n-gram generator into a stand-alone (well not quite) DLL.

I'm sure beers and dinner would be a nice end to this week... you should perhaps suggest that to your friend, he's likely writing technical reports today.

I was about to :-O if you had actually solved .... and they make robots (MATLAB-Compiler and Schlumberger employees) that can do that now...

seeing as my time limit for edits has expired, I just wanted to clarify: I meant /n-gramerator/ and "solved n-gramming recursively" and so on and so forth. Yes, n-grams can become verbs.. anything can become a verb.

I was about to :-O if you had actually solved permutations recursively. To iterate is human to recurse is devine as they say...

Here's the full code for regressive permutations. I wasn't sure what to do for words like "tree", with duplicate letters ('e' in this case). Do I "swap" the e's in positions 3 and 4 count "tree" and "tree" as two separate permutations? Or does each permutation have to have a unique spelling? I'm not sure what the definition of "permutation" is on that one. If all the permutations have to have unique spelling, a little code to search the vector before adding the "new" permutation and only adding the "new" permutation if it's not already in the vector would solve that. This code doesn't do that (i.e. doesn't search the vector for duplicates), so it adds duplicate spellings. That's only an issue when one or more of the letters in a word is the same.

#include <vector>
#include <string>
#include <iostream>
using namespace std;

void Permutations (vector <string>& thePermutations, 
     string startOfString, string stringToPermute);

int main ()
{
    cout << "Enter a word: ";
    string aWord;
    cin >> aWord;
    
    vector <string> thePerms;
    Permutations(thePerms, "", aWord);
    for (int i = 0; i < thePerms.size(); i++)
    {
        cout << thePerms.at(i) << endl;
    }

    return 0;
}


void Permutations (vector <string>& thePermutations, 
     string startOfString, string stringToPermute)
{  
     // base case. stringToPermute's length = 1
     if (stringToPermute.length() == 1)
     {
          string aString = startOfString + stringToPermute;
          thePermutations.push_back(aString);
     }
     else
     {
          for (int i = 0; i < stringToPermute.length(); i++)
          {
              char aChar = stringToPermute[i];
              string aString = startOfString + aChar;
              string anotherString(stringToPermute);
              anotherString.erase(i,1);
              Permutations(thePermutations, aString, anotherString);
          }
     }
}

Neat trick. Recursion still gives me a headache at this point, but I suspect that'll change after I take a class or two on data structures this upcoming semester... either that or the headache gets worse and my head implodes.

Recursion can be useful/fun/illuminating/confusing all at the same time. Here's another tack on a possible recursive solution:

void perm(string s, int permLen)
{
  int strLen = s.length();
  if(permLen == strLen)
    cout << s;
  else
  {
    for(int i = 0; i + permLen <= strLen; ++i)
    {
      for(int j = i; j < permLen + i; ++j)
        cout << s[j];

      cout << endl;
    }
    perm(s, ++permLen);
  }
}

But iterative solutions are fine in most circumstances. Here's a possible iterative tack:

int strLen = s.length();
for(int permLen = 1; permLen <= strLen; ++permLen)
  for(int i = 0; i + permLen <= strLen; ++i)
     cout << s.substr(i, permLen) << endl;

iteration would be more suitable for our friend Shaun. If you remember from his question he's doing biomedical engineering work, likely his stack will overflow if he tried to n-gram 20k+ lines of input recursively...

:) Yeah.

I should mention, I have more than 200k lines.

Furthermore, the length of the strings can be anywhere from four character to over 500 characters, so speed is a pretty big consideration.

This is the function I ended up using. Pretty straightforward, and it gets the job done.

Now I need to enter the grams I generate into a hash table...
Which means I should probably learn something about hash tables.

void ngram(string SMILES, int strLen, ofstream *osptr)
{
    int pos = 1;
    for (int i = 1; i <= MAXGRAM; i++)
        for (int j = 0; j < (strLen - (i - 1)); j++)
            *osptr << SMILES.substr(j, i) << endl;
}

Thanks for all the help everyone, that substr function really made the job easy. I hope I provoked some fun discussion while I was at it.

This is cool; as it stands, I can isolate and parse an average of 723 strings per second with that.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.