Hi

I am needing to store some data for a project I am working on, I am trying to figure out the fastest method to access the stored data, the data takes the form:

keyword (std::string) - list of ints

ie:

yellow - 2342 2312 8478 3827 9773 4837 2893 0983 478 2981, etc

I need to store this information in a memory based cache, size isn't really in issue the machine this will be running on has 8g of ram of which I can freely use about 2g.

I am thinking a Btree would be best and then store the ints in a vector, ie:

Btree node yellow:
vector<int>
vector<int>

More background this is a searchengine previously written in perl, which came up against some obvious limitations when it started getting >10 hits a second, it is being reimplimented in C++ to remove some of the limitations of perl.

The data was previously stored in a large hash of arrays:

$hash{yellow} = (int, int, int, int, int, int, int);

So data was retreived pretty quickly.

I am looking at about 2000 keywords and about 3000 ints which can repeat in different keywords, when a request is made I need to be able to pull all the ints in the requested keyword.

I will be able to figure out the code without too many problems I am just looking for some recommendations on the most efficient way to store this data for rapid retrival. I read that Btree is the quickest when dealing with strings or character arrays (I can convert the data to be char[] from std::string if need be), and that linked lists are better for smaller amounts of data. I could use map<std::string, vector<int>> I guess (I am not sure how efficient that would be though).

So if anyone has recommendations on the best methods please share :)

Thanks
Ben

Hello Ben,

I do not remember the Com Sci specifics in class, but if you can sort your data in memory, then a Btree should be the fastest search. You will have to design your code though so that the tree is evenly balanced. Of course, if the data is random, then good luck with you there. But if you can sort it, I would B-TREE it.

Your worst case is going to be a linked-list situation where your data is stored in the last node of the data stream. Then again, if you make a lopsided B-TREE, then you are left with the linked list.

If possible, you might be able to also build into your data bucket, a repition field, so that when you are loading your data structure, if you have a repeat of the exact data, you can increment a repeate variable, so that you know right away that your data set had X instances of the same information.

Good luck coding it. And if your compiler supports profiling (generating statistics), you might find some interesting things out.

Christian

Excellent I have a good book covering how to build well balanced B-Trees so I guess tonight in bed I will be reading up on those chapters.

I am using g++ 2.95 as the compiler on Solaris 8 sparc, and intel and linux 2.40.20 on x86 all of which support profiling I believe, so I will check that out.

Thanks for the advice :o)

Ben

you're better off using the STL if you can, their algorithms are far superior to what most of us could write.

you're better off using the STL if you can, their algorithms are far superior to what most of us could write.

By using STL, do you mean maps vectors etc and avoiding btrees?

Thanks

Ben

actually, the STL containers may in fact be implemented with btrees, but that is transparent to you the programmer. u just stick the elements in and let them worry bout the implementation, and when u pop them out it will always be in linear time. for your strings u could just use a simple vector.

actually, the STL containers may in fact be implemented with btrees, but that is transparent to you the programmer. u just stick the elements in and let them worry bout the implementation, and when u pop them out it will always be in linear time. for your strings u could just use a simple vector.

Excellent time for me to invest some more research into STL, since the most important factor in this is speed perhaps STL is the best option I think perhaps I will make out some test code this afternoon to speed check it, if that isn't up to what I need I can always go back and try and impliment my own B-trees later, you are right though STL is there why reinvent the wheel (yes you can tell I usually code perl :op )

Thanks

Ben

Those are some great links, I have been doing quite a bit recently with map and a little stuff with vector, but beyond using the named array functionality I have done much investigating, good stuff.

Thanks
Ben

This article has been dead for over six months. Start a new discussion instead.