Hi All,

I have to handle text files of larger sizes ranges from 10GB and more which are exported from some application softwares. The text files contains the required information scattered throughout the file. I need to gather all those information in a particular format to do further analysis. Say for example, if the input file contains as following:

#10 // time stamp 10
#15 // time stamp 15

I have to gather the information as,

! 10 0 15 1
( 10 1
= 10 1
: 10 0

So please suggest me how to store the above gathered information either as a vector, class arrays, or anything else like database e.t.c., The information surely occupies size larger than 10GB. And i have to access the gathered information for further progress.

With Regards,

Perhaps a better example of how the input relates to the output would help us make some suggestions. It could be binary for all I know.

If you intend to store the results in memory, then the condensed output needs to be at most a few hundred MB in size (down from your 10GB+ size).

The input file contains the information in the following format,

1 SN1
0 SN2
0 SN3
x SN1
1 SN2
0 SN5
0 SN1

In the above given format,
#10, #15, #50 are timestamps
SN1, SN2, SN3, SN5 are signal names
0, 1, x, are logical values

From the above example, i have to gather information like SN1 is changing at #10, #15, and #50. and the logical values are 1, x, 0. Likewise, have to gather all the information of each and every signal. From the gathered information, in future, if i refer one signal name, i could be able to retrieve all the relevant information (i.e., timestamps, and logical values) of that corresponding signal name. So i need all the information for future reference. So suggest me the best way to store the information.

Gathered information from above example,

SN1 10 15 50
1 x 0

SN2 10 15
0 1

SN3 15

SN5 50

So from what you're saying, the new file is also going to be of the order of 10GB in size, because it contains all the information, but re-arranged into a different order.

If this is the case, then there is no possibility of storing the whole thing in memory.

How many signals do you have to deal with? Is it like say 10 (a few), or 1000 (lots) ?

The total number of signals range from 1 - 1000 and more. and also the information associated with each and every signals will also be numerous normally(i.e., there will be lakhs and lakhs of timestamps).

Since you're not going to store the whole data in memory at once, I would suggest the following.

Read in a number of lines (say approx 10M), until you've stored several hundred GB of data (or whatever your machine can support without excessive swap file usage).
Write all the data (in the new format) to a temp file.
Repeat these two steps until you have a collection of temp files.

Then read from all your temp files to produce a "merge" of all the intermediate results to produce the final (single) result file.

Is the data formatted as above? With each signal on a new line? It should be easy enough to parse then... How many time stamps are there?

So, if I understand this, basically you have a kind of log file that is written out by time stamp but which you need to put into signal order?

If so, it seems like you need something like a map to group the data where the keys are the signals and the data is a pointer to some kind of container (list, vector or whatever) to hold the data, namely the timestamps and values.

The problem is that you can't hold all of this in memory so you need to use temporary files. Which means you need to dump your data containers to files whenever a signal container reaches a certain fill point. And since the easiest way to do this is to have a temp file per signal, you need some way to keep track of how many files you have open at once (lest you exceed the system limit) or append and close the files each time you dump the data. Finally, assuming the data is now correctly organized in the temp file, you need to read the temp files and essentially append them together in one big output file.

Or have I completely misunderstood?


Thank you all for your kind response and suggesstions. I will try with creation of temporary files and let you all know the result in a week or two..

Thank you all,.

T. BalaKrishnan