Hi all,

This is quite a specific question with a relatively large amount of explaining so ill try to make it really short to try save peoples time.

In a nutshell what i have is a large csv file and im trying to search a particular entry (line in the file) as fast as possible.

I origianlly just read the whole file and kept all the elements in an array of struct in memory but it has grown massively large, so that is not an option anymore. The file contains the collection of structures with each element of the structure seperated by a comma and each structure on a new line. so something like this

elementId,int,float,string,int,int,<lots more>

example
0,463,5.67,tom,marketing,1,0,....
1,365,5.78,john,sales,4,1,...

What im currently doing is something like this ( not real code so it might have typo's / syntax errors sorry )

void CSVManager::FetchEntryFromCSV( uint32_t nId, struct csvElement_st& result )
{
   /* string to store the line from the text file */
   string sFileEntry;

   /* open file handle */
   ifstream fCSVFile;
   fCSVFile.open ( m_sCsvFilePath, ifstream::in );

   /* skip the records i dont want */
   for( uint32_t n = 0; n < nId; n++ )
   {
     getline( fCSVFile, sFileEntry );
   }

   /* The previous loop skipped all the records that come before the one i want 
    * so the next line in the file is the element i want
    * so get this one from the file 
    */

    /* get the record i do want */
    getline( fCSVFile, sFileEntry );

    /* split out all the comma's and put the actual elements into the struct */
    ParseLineToStruct( sFileEntry, result );

    /* the struct passed in now has the details from the csv file so were done here :) */
}

And now finally the core of the question is,

Can i make this faster and ideally as close to constant time to look up any element?

I had considered working from the start or end of the file based on if the index was over 50% of the file down ( i know this information up front ) but i dont know how to do a reverse file handle that would do getline() from the bottom up.

Is there any quick way i can do a goto line without seeking through the whole file or is that asking too much?

Many thanks in advance and if anything is not clear just ask ill try to explain it better.

Thanks Kano

Recommended Answers

All 7 Replies

>>Is there any quick way i can do a goto line without seeking through the whole file or is that asking too much?

Depends. If it is a file that you want processed only once, then the answer would be no. But if you need to process the file multiple times either in the same or different sessions then maybe. What I have done in such cases is to create an index file that contains a key field and offset to that line in the original file. This index file is much smaller than the original cvs file, can be sorted after it is created so that your program can use binary search techniques on it (assuming the key file is a binary file). If you don't want to do that then the only alternative I have is to read the whole file serially until you get to the desired line.

is the data you are looking for in random locations horizontally?
If not, you can create a structure of just the key elements and the offset in the file. Then after you've found your key element, you can "seek" back to the location that contains your target row.

This still has sequential elemements in it, but it all depends on how you want to use the data.
Also, you could also make your own indexes (almost the same thing) in the file and store them until the file changes.

Is databasing this file an option?

Thanks AD,

I like your idea and it should work for me, The file i have is statically compiled information that does not change, its basically configuration metrics but that didnt give a very good example so my example csv data is rubbish above.

So if i understand you correctly,

I have a prepare function that basically zipps over the file on init and gets the size of the file, number of lines in the file and that kind of thing. You would then suggest that i create a small LUT in memory that contains the binary file pointer index such as used by fseek() for each record? so i would have something like

uint32_t FileLineIndexes[ kNumLines ];

And in each slot in the array i keep the file pointer to each line in my csv. With this then to access each element my function would be something like ( the file handle is always open in my csv manager so fopen can be skipped )

void CSVManager::FetchEntryFromCSV( uint32_t nId, struct csvElement_st& result )
{
   fseek( fCSVFile, FileLineIndexes[ nId ], SEEK_SET );

   /* then parse the line and assign to the structore here */
}

This seems like a great idea but leaves me with one final question,

Can i use my ifstream file object with the cstyle fseek operation or do i need to figure out some kind of a workaround for that?

Thanks again

Sorry thines,

Didnt see your post there, but i think its along the same lines as what AD had posted and looks like what i think you are both suggesting i should do.

I cant make the file a database a csv is the only option due to it is already used in other legacy parts of the system.

However had a database been an option would querying it be faster than the text file search?

Thanks for your response

However had a database been an option would querying it be faster than the text file search?

That would really depend on how often the file changes and the types of queries against it.
If it could be loaded in times of low usage and it is properly indexed, yes, it would be quicker than searching the file. If the file changes often, it possibly would not give much benefit. If the data is not queried often, the benefit would still be low.

Can i use my ifstream file object with the cstyle fseek operation or do i need to figure out some kind of a workaround for that?

seekg

Hi,

Thats totally excellent. In my case the file is not changed very often, as in once every few months and the number of queries depends on user requirements some frequently some rarely so i think this scenario would be able to make good use of the database if it had been an option.

Thanks again everyone for the answers, i think we can give this one a case closed :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.