| | |
Fast data storage
Please support our C++ advertiser: Intel Parallel Studio Home
![]() |
Hi
I am needing to store some data for a project I am working on, I am trying to figure out the fastest method to access the stored data, the data takes the form:
keyword (std::string) - list of ints
ie:
yellow - 2342 2312 8478 3827 9773 4837 2893 0983 478 2981, etc
I need to store this information in a memory based cache, size isn't really in issue the machine this will be running on has 8g of ram of which I can freely use about 2g.
I am thinking a Btree would be best and then store the ints in a vector, ie:
Btree node yellow:
vector<int>
vector<int>
More background this is a searchengine previously written in perl, which came up against some obvious limitations when it started getting >10 hits a second, it is being reimplimented in C++ to remove some of the limitations of perl.
The data was previously stored in a large hash of arrays:
$hash{yellow} = (int, int, int, int, int, int, int);
So data was retreived pretty quickly.
I am looking at about 2000 keywords and about 3000 ints which can repeat in different keywords, when a request is made I need to be able to pull all the ints in the requested keyword.
I will be able to figure out the code without too many problems I am just looking for some recommendations on the most efficient way to store this data for rapid retrival. I read that Btree is the quickest when dealing with strings or character arrays (I can convert the data to be char[] from std::string if need be), and that linked lists are better for smaller amounts of data. I could use map<std::string, vector<int>> I guess (I am not sure how efficient that would be though).
So if anyone has recommendations on the best methods please share
Thanks
Ben
I am needing to store some data for a project I am working on, I am trying to figure out the fastest method to access the stored data, the data takes the form:
keyword (std::string) - list of ints
ie:
yellow - 2342 2312 8478 3827 9773 4837 2893 0983 478 2981, etc
I need to store this information in a memory based cache, size isn't really in issue the machine this will be running on has 8g of ram of which I can freely use about 2g.
I am thinking a Btree would be best and then store the ints in a vector, ie:
Btree node yellow:
vector<int>
vector<int>
More background this is a searchengine previously written in perl, which came up against some obvious limitations when it started getting >10 hits a second, it is being reimplimented in C++ to remove some of the limitations of perl.
The data was previously stored in a large hash of arrays:
$hash{yellow} = (int, int, int, int, int, int, int);
So data was retreived pretty quickly.
I am looking at about 2000 keywords and about 3000 ints which can repeat in different keywords, when a request is made I need to be able to pull all the ints in the requested keyword.
I will be able to figure out the code without too many problems I am just looking for some recommendations on the most efficient way to store this data for rapid retrival. I read that Btree is the quickest when dealing with strings or character arrays (I can convert the data to be char[] from std::string if need be), and that linked lists are better for smaller amounts of data. I could use map<std::string, vector<int>> I guess (I am not sure how efficient that would be though).
So if anyone has recommendations on the best methods please share

Thanks
Ben
Application development, webhosting, and much more: www.webcentric-hosting.com
•
•
Join Date: Mar 2004
Posts: 1,620
Reputation:
Solved Threads: 51
Hello Ben,
I do not remember the Com Sci specifics in class, but if you can sort your data in memory, then a Btree should be the fastest search. You will have to design your code though so that the tree is evenly balanced. Of course, if the data is random, then good luck with you there. But if you can sort it, I would B-TREE it.
Your worst case is going to be a linked-list situation where your data is stored in the last node of the data stream. Then again, if you make a lopsided B-TREE, then you are left with the linked list.
If possible, you might be able to also build into your data bucket, a repition field, so that when you are loading your data structure, if you have a repeat of the exact data, you can increment a repeate variable, so that you know right away that your data set had X instances of the same information.
Good luck coding it. And if your compiler supports profiling (generating statistics), you might find some interesting things out.
Christian
I do not remember the Com Sci specifics in class, but if you can sort your data in memory, then a Btree should be the fastest search. You will have to design your code though so that the tree is evenly balanced. Of course, if the data is random, then good luck with you there. But if you can sort it, I would B-TREE it.
Your worst case is going to be a linked-list situation where your data is stored in the last node of the data stream. Then again, if you make a lopsided B-TREE, then you are left with the linked list.
If possible, you might be able to also build into your data bucket, a repition field, so that when you are loading your data structure, if you have a repeat of the exact data, you can increment a repeate variable, so that you know right away that your data set had X instances of the same information.
Good luck coding it. And if your compiler supports profiling (generating statistics), you might find some interesting things out.
Christian
Excellent I have a good book covering how to build well balanced B-Trees so I guess tonight in bed I will be reading up on those chapters.
I am using g++ 2.95 as the compiler on Solaris 8 sparc, and intel and linux 2.40.20 on x86 all of which support profiling I believe, so I will check that out.
Thanks for the advice :o)
Ben
I am using g++ 2.95 as the compiler on Solaris 8 sparc, and intel and linux 2.40.20 on x86 all of which support profiling I believe, so I will check that out.
Thanks for the advice :o)
Ben
Application development, webhosting, and much more: www.webcentric-hosting.com
•
•
•
•
Originally Posted by infamous
you're better off using the STL if you can, their algorithms are far superior to what most of us could write.
Thanks
Ben
Application development, webhosting, and much more: www.webcentric-hosting.com
•
•
•
•
Originally Posted by infamous
actually, the STL containers may in fact be implemented with btrees, but that is transparent to you the programmer. u just stick the elements in and let them worry bout the implementation, and when u pop them out it will always be in linear time. for your strings u could just use a simple vector.
Thanks
Ben
Application development, webhosting, and much more: www.webcentric-hosting.com
•
•
Join Date: Mar 2004
Posts: 77
Reputation:
Solved Threads: 2
i am a mainly C programmer(i do know C++ a bit, but not as well as C), but i decided to mess with STL a few weeks ago for a school project in AI class. i was amazed at how simple it was to pick up. check out this:
http://www.icce.rug.nl/documents/cplusplus/
http://www.camtp.uni-mb.si/books/Thi...Chapter04.html
http://www.icce.rug.nl/documents/cplusplus/
http://www.camtp.uni-mb.si/books/Thi...Chapter04.html
Those are some great links, I have been doing quite a bit recently with map and a little stuff with vector, but beyond using the named array functionality I have done much investigating, good stuff.
Thanks
Ben
Thanks
Ben
Application development, webhosting, and much more: www.webcentric-hosting.com
![]() |
Similar Threads
- News Story: Understanding Data Storage and De-Deduplicating (Storage)
- How to encrypt data during storage? (Oracle)
- Suitable DTD for the data storage design (XML, XSLT and XPATH)
Other Threads in the C++ Forum
- Previous Thread: confused with errors in c++
- Next Thread: Mode 13 graphics,Problem with io streams and new operator
| Thread Tools | Search this Thread |
api array arrays beginner binary bitmap c++ c/c++ calculator char class classes code compile compiler console conversion convert count data database delete desktop developer directshow dll download dynamic encryption error file forms fstream function functions game generator getline givemetehcodez google graph gui homeworkhelper iamthwee ifstream input int integer java lib linkedlist linker linux loop looping loops map math matrix memory multiple news node number output parameter pointer problem program programming project proxy python random read recursion recursive return string strings struct temperature template templates test text text-file tree unix url variable vector video visualstudio win32 windows winsock word wordfrequency wxwidgets






