2

CSV is more than just comma delimited fields. There are quoting and white space rules too. This is a short function as an example of a complete way to read CSV records line by line.

#include <istream>
#include <string>
#include <vector>

/// <summary>loads a CSV record from the stream is</summary>
/// <remarks>
/// * leading and trailing white space is removed outside of 
//    quoted sections when trimWhiteSpace is true
/// * line breaks are preserved in quoted sections
/// * quote literals consist of two adjacent quote characters
/// * quote literals must be in quoted sections
/// </remarks>
/// <param name=is>input stream for CSV records</param>
/// <param name=trimWhiteSpace>trims white space on unquoted fields</param>
/// <param name=fieldDelim>field delimiter. defaults to ',' for CSV</param>
/// <param name=recordDelim>record delimiter. defaults to '\n' for CSV</param>
/// <param name=quote>delimiter for quoted fields. defaults to '"'</param>
/// <returns>a list of fields in the record</returns>
std::vector<std::string> CsvGetLine(std::istream& is, 
                                    bool trimWhiteSpace=true,
                                    const char fieldDelim=',',
                                    const char recordDelim='\n',
                                    const char quote='"')
{
    using namespace std;

    vector<string> record; // result record list. default empty
    string field;          // temporary field construction zone
    int start = -1,        // start of a quoted section for trimming
        end = -1;          // end of a quoted section for trimming
    char ch;

    while (is.get(ch))
    {
        if (ch == fieldDelim || ch == recordDelim)
            // fieldDelim and recordDelim mark the end of a
            // field. save the field, reset for the next field,
            // and break if there are no more fields
        {
            if (trimWhiteSpace)
                // trim all external white space
                // exclude chars between start and end
            {
                const string wsList = " \t\n\f\v\r";
                int ePos, sPos;

                // order dependency: right trim before let trim
                // left trim will invalidate end's index value
                if ((ePos = field.find_last_not_of(wsList)) != string::npos)
                {
                    // ePos+1 because find_last_not_of stops on white space
                    field.erase((end > ePos) ? end : ePos + 1);
                }

                if ((sPos = field.find_first_not_of(wsList)) != string::npos)
                {
                    field.erase(0, (start != -1 && start < sPos) ? start : sPos);
                }

                // reset the quoted section
                start = end = -1;
            }

            // save the new field and reset the temporary
            record.push_back(field);
            field.clear();

            // exit case 1: !is, managed by loop condition
            // exit case 2: recordDelim, managed here
            if (ch == recordDelim) break;
        }
        else if (ch == quote)
        {
            // save the start of the quoted section
            start = field.length();

            while (is.get(ch))
            {
                if (ch == '"')
                {
                    // consecutive quotes are an escaped quote literal
                    // only applies in quoted fields
                    // 'a""b""c' becomes 'abc'
                    // 'a"""b"""c' becomes 'a"b"c'
                    // '"a""b""c"' becomes 'a"b"c'
                    if (is.peek() != '"')
                    {
                        // save the end of the quoted section
                        end = field.length();
                        break;
                    }
                    else field.push_back(is.get());
                }
                else field.push_back(ch);
            }
        }
        else field.push_back(ch);
    }

    return record;
}

#if defined(TEST)
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

using namespace std;

int main()
{
    string csv = 
        "a\"\"b\"\"c\n"
        "a\"\"\"b\"\"\"c\n"
        "\"a\"\"b\"\"c\"\n"
        ",\n"
        "a\n"
        ",a\n"
        "a,\n"
        "a,\" a\"a,b\"b \"\n"
        "aa,b\"b,c\"c,   d   ,,ee,ff,  g\"g,h\"h  \n"
        "aa,  bb,cc  ,\"  dd\",\"ee  \",\"f,  g\n"
        ",  h\",\"i,\"\"j,k\"\",l\"\n";
    istringstream is(csv);

    while (true)
    {
        typedef vector<string> rec_t;

        rec_t rec = CsvGetLine(is);

        if (rec.size() == 0) break;

        for (rec_t::iterator x = rec.begin(); x != rec.end(); ++x) 
        {
            cout << '>' << *x << "<\n";
        }

        cout << string(20, '*') << '\n';
    }
}
#endif
5
Contributors
5
Replies
7
Views
7 Years
Discussion Span
Last Post by moonlight01
0

When compiling it in Microsoft Visual Studio 2008,I found some
errors as below:"error LNK2019 and fatal error LNK1120"!

0

You need to #define TEST to run the code as is. main() is conditionally compiled because this is a library function.

0

This is quite useful code and good example of simple CSV parser. But I thing it contains one disability. If the last input line isn't terminated by right recordDelim character last returned record doesn't contain last value.

0

- bug: last term will NOT be read in eof situation
- dangerous: according specification the return value of std::istream::get is only for prototypes returning the number of characters read. Instead for detection of eof the istream::good() function should be used, at the beginning and after the get operation
- no error check on missing closing '"'
- some implementations might use for the variables "start" and "end" unsigned int
- instead of the "-1" string::npos should be used. Operations should not depend on string::npos having a specific value
- field.erase((end > ePos) ? end : ePos + 1); should also check for end != string::npos condition
- code not really elegant as there are several variable involved, having different states. Better to have in the routine only ONE variable with several states
- code probably much slower than straight c code using fast pointer operations. It is questionable if c++ is here really an advantage
- perhaps instead of char TCHAR should be used

All in all the code seems to me not recommendable for productive code on several platforms

Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.