I have written the following function to accept a string from a csv text file that presumably contains a series of comma separated numbers.

double comma_read(int &index, const string &input)
  // Reads a number from a comma delimited (csv) file
  // Returns first number after index-th comma
  // Then updates the index value to prepare for the next read
  {
  string accum="";
  int commacount=0;
  int position=0;
  double out;
  const int inputlength=int(input.length());
  while (index!=commacount)
   {
    if (input[position]==',') commacount++;
    position++;
   }
  if (input[position]==',') // data is blank
    out=MISSING;
  else if (position==inputlength) // data is blank at the end of the data line
    out=MISSING;
  else // data is not blank
   {
    while (input[position]!=',')
     {
      accum+=input[position];
      position++;
      if (position==inputlength) break;
     }
    char *p; // needed for strtod function 
    out=strtod(accum.c_str(), &p);
   }
  if (fabs(out - -3.70e28) <= ZERO) out=MISSING;  // checking for old_csv missing data
  index=commacount+1;
  return out;
  }

The function is return the double that is located after the index-th comma in the string.
You can see that I approach this problem by parsing the string, first looking for the index-th comma in the string. Once the index-th comma is found I accumlate the contents of string charcter by charter into the string accum until another comma is found or the end of line is reached. Then the accum string is converted to a double using the strtod function. The index updates so that I can use this function in a loop to read all the numbers in the line.

The question I have is, is the the most efficient way to accomlish this task?
Any ideas for improvement?

Thanks.

Recommended Answers

All 8 Replies

Well I see the keyword "string", so you are using C++. So don't limit yourself to C, which you are doing. Stringstreams, containers, and algorithms are available in C++. If this a bunch of numbers separated by commas, go through your string and replace the commas with spaces, create a stringstream from the revised string, and read your integers from that stringstream. Easier to code and easier to read.

#include <string>
#include <sstream>
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;

int main(int argc, char** argv) 
{
    string str = "5,8,4,9,124,23,17";
    replace(str.begin(), str.end(), ',', ' ');
    stringstream ss(str);
    vector<int> intVector;
    int x;
    while(ss >> x)
    {
        intVector.push_back(x);
    }
    int numInts = intVector.size();
    for(int i = 0; i < numInts; i++)
        cout << intVector[i] << endl;
    return 0;
}

Note: Easy enough to change the above from int to double. Another benefit is that you don't have to worry about weird strings with spaces and tabs in there. Stringstream takes care of all that for you.

Thanks. I'll have to read up on stringstream.

But what happens if the data string is something like:
string str = "5,8,4,9,124,,23,17";
That is two commas between the values 123 and 23, which is often used to signify a missing data value by the machine that is generating these data? My function checks for this and assigns MISSING if that is the case. Is there a way to incorporate that type of check with the stringstream so that the constructed vector intVector contains a MISSING value in the appropriate place?

Note, if you read that csv line into Excel the double comma would appear as an empty cell.

I like the idea that your method is generic for comma, space or tab separated values. Most space separated data files, that I know of, use a code for missing value, like my MISSING, when that occurs. But comma and tab separated files frequently use a double comma or double tab to skip over the missing (empty cell) value. I do not see how stringstream would handle that.

Also, I notice that if the input string was something like:
string str = "5,8,4,9,124,missing,23,17";
The function you provide would not read anything beyond the value 124. It appears that once a non-numeric value appears the stringstream cannot distinguish the string fragment "missing" from the converted string fragement "missing 23 17". So, how would I read numbers beyond the word in the string?

Thanks for your help.

"5,8,4,9,124,23,17" and "5,8,4,9,124,23,,17" would yield the exact same results, so if they need to yield different results representing empty fields, you'd need to do it a bit differently.

Here's what stringstreams do. The program I wrote will function exactly the same if you replace this

while(ss >> x)
{
    intVector.push_back(x);
}

with

while(cin >> x)
{
    intVector.push_back(x);
}

and typed the string with spaces where the commas are on the input line except that you would not get out of the loop using cin because it would not know that "end of file" was reached like with a stringstream or an ifstream.

Ditto if you replace the commas with spaces on "5,8,4,9,124,missing,23,17" to make "5 8 4 9 124 missing 23 17". If you typed this into the console, cin would read till it hit "missing", then fail because it is looking for digits and it found an 'm' since my program had it reading in integers, not strings.

So that's why my program behaved as it did. I designed it as an example not considering the double commas or the "missing" part. Now to your question as to efficiency since I obviously skimmed over your question rather than reading it. Sorry bout that.

Give this a whirl. Not sure this is any easier than what you had before. You can also use the getline function and specify the delimiter as a comma rather than a newline. Lots of ways to do this. I personally think this is a little easier to read and follow. YMMV.

#include <string>
#include <sstream>
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;

const string MISSING = "-999999";

int main(int argc, char** argv)
{
    string str = ",,,5,8,4,9,,124,,,23,17,,";

    // find double commas and replace with MISSING
    size_t pos = 0;
    do
    {
        pos = str.find(",,", pos);
        if(pos != string::npos)
        {
            str.insert(pos+1, MISSING);
        }
    }
    while(pos != string::npos);

    // now check at the front and back of the string for commas
    if(str[0] == ',')
    {
        // replace comma with space and insert MISSING
        str[0] = ' ';
        str.insert(0, MISSING);
    }
    if(str[str.length()-1] == ',')
    {
        // replace comma with space and insert MISSING
        str[str.length()-1] = ' ';
        str.insert(str.length(), MISSING);
    }

    // now replace commas with spaces
    replace(str.begin(), str.end(), ',', ' ');
    stringstream ss(str);
    vector<double> doubleVector;
    double x;
    while(ss >> x)
    {
        doubleVector.push_back(x);
    }
    int numDoubles = doubleVector.size();
    for(int i = 0; i < numDoubles; i++)
        cout << doubleVector[i] << endl;
    return 0;
}

Regarding the actual word "missing" being in there, I didn't see where the code you posted handles that, so I didn't offer an alternative.

The function I provided did not handle "missing" as a word within the string. I have another, similar, function that does that. It was just that I was intrigued by what you had shown me and so I was thinking about what else it could do.

Now you've got me wondering about efficiency. The code example that you originally provided would go through the string three times.
First to replace commas with spaces.
Second to place the string into stringstream -- don't know if that really counts as one time through.
Third to place the stringstream contexts into the vector.

My function went through the string twice.
First to count the commas in the string, which is really a check to see if the index-th comma is beyond the end of the string.
Second to find the string element after the index-th comma -- really only a complete pass through if we want the last number in the string.

So, do passes through the string represent efficency (in the same way that reading passes through a file would)?
If so, is the stringstream method still more efficient, or is it merely a more elegant version of the C++ code?

Thanks again.

Well this is annoying. I wrote a long response that somehow got lost. Won't rewrite it, but here's a version that handles words like "MISSING".

#include <string>
#include <sstream>
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;

const string MISSING = "-999999";
const string ERROR   = "-888888";
const double MISSING_VAL = -999999;
const double ERROR_VAL   = -888888;
const string MISSING_STRING = "MISSING";

int main(int argc, char** argv)
{
    string str = ",,,5,8,4,9,MISSING,junk, more_junk,124,,,23,17,,";
    string temp = "h";

    // find double commas and replace with MISSING
    size_t pos = 0;
    do
    {
        pos = str.find(",,", pos);
        if(pos != string::npos)
        {
            str.insert(pos+1, MISSING);
        }
    }
    while(pos != string::npos);

    // now check at the front and back of the string for commas
    if(str[0] == ',')
    {
        // replace comma with space and insert MISSING
        str[0] = ' ';
        str.insert(0, MISSING);
    }
    if(str[str.length()-1] == ',')
    {
        // replace comma with space and insert MISSING
        str[str.length()-1] = ' ';
        str.insert(str.length(), MISSING);
    }

    // now replace commas with spaces
    replace(str.begin(), str.end(), ',', ' ');
    stringstream ss(str);
    vector<double> doubleVector;
    double x;

    cout << str << endl;
    while(true)
    {
        ss >> x;
        if(ss.fail())
        {
            // we read something other than a double.  Clear and try to read it in
            // as a string instead.
            ss.clear();
            ss >> temp;
            if(temp == MISSING_STRING)
                x = MISSING_VAL;
            else
                x = ERROR_VAL;
            if(ss.bad())
            {
                // if it's STILL bad, give up.  Should not get here
                break;
            }
        }

        doubleVector.push_back(x);
        if(ss.eof())
            break; // reached end of line
    }
    int numDoubles = doubleVector.size();
    for(int i = 0; i < numDoubles; i++)
        cout << doubleVector[i] << endl;
    return 0;
}

This is one of the most common real-world problems you will encounter in computer programming. There are a number of approaches. There is simple but slow, and more complex but fast. Which depends upon your needs. If you are needing parse a million of these strings in a short period, then complex but fast is better. If not such a high volume of data, then simple but slow would be better. Simple but slow: copy data to a standard C string buffer up to the next comma and use strtod() to convert the value to a double, or atoi()/atol() for integers or long ints. Don't forget to null-terminate the buffer first before conversion in either case. Complex but fast is another story. Buffer, array, right-to-left scanning of string to previous comma, replacing comma with null, put numeric string in array, scan to next previous comma, replace with null, put string in array, repeat until you get to beginning of the string. Then you can convert each member of the array. Remember that in this case, the values will be in reverse order. If the entry is empty, you need a default value (true in either slow or fast approach), usually either 0 or -1, your choice what is appropriate.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.