Hello,i am developing an indexer that indexes html pages.My problem lies in the aspect of creating a global index of all the stored pages in the reprository.I need a method that will be suitable for quick retrieval and insertion of new data.I have tried using STL MAP but i do not know how it will perform with so many data in it.I was told a b-tree can do it.

What are your suggestions for a problem like this?

Thanks

Usually std::map is implemented using a rb-tree (red-black-tree) and not a B-tree (Bayer-tree)

I have tried using STL MAP but i do not know how it will perform with so many data in it.

How much data is 'so many data'. And why haven't you just given it a try yet?

How much data is 'so many data'. And why haven't you just given it a try yet?

Am looking at up to a billion,i haven't such data yet,but i would like to know if someone has an idea on how it will perform especially when trying to load from a file.

Am looking at up to a billion,i haven't such data yet,but i would like to know if someone has an idea on how it will perform

The STL map class has performance guarantees that match a balanced search tree. The height will be small for a billion items, like 2*log(n) for a red black tree, so lookup is quick. When lookup is quick, insertion and deletion will be quick too.

Really all you need to worry about with performance is paging and locality of reference. If there are a lot of items in memory that are in different pages, the computer will be swapping pages in and out and that slows memory accesses. If there are so many items that RAM cannot hold it all, the whole machine can slow down as memory gets swapping and out of virtual memory.

especially when trying to load from a file.

Files are slow. There is no getting around that. :) If loading the whole file is too slow, you can use windows into the file and only process part of it at any time. Just make sure that the file format makes seeking possible.

The STL map class has performance guarantees that match a balanced search tree. The height will be small for a billion items, like 2*log(n) for a red black tree, so lookup is quick. When lookup is quick, insertion and deletion will be quick too.

Really all you need to worry about with performance is paging and locality of reference. If there are a lot of items in memory that are in different pages, the computer will be swapping pages in and out and that slows memory accesses. If there are so many items that RAM cannot hold it all, the whole machine can slow down as memory gets swapping and out of virtual memory.


Files are slow. There is no getting around that. :) If loading the whole file is too slow, you can use windows into the file and only process part of it at any time. Just make sure that the file format makes seeking possible.

I love this.Is it possible to seek a map while saved in a file.How do you suggest i make windows for the file?

The main question now is,how do i ensure great perfomance when loading an STL map of a billion pages into memory.I need your suggestions..

Thanks in advance

Is it possible to seek a map while saved in a file.How do you suggest i make windows for the file?

You can seek in a file, and load from a file into a map. Unless you want to use a third party external B-tree library, that is probably as good as it gets. As for the window idea, it is nothing more than looking at a smaller part of the file at once, and loading another small part when searching leaves that view.

The main question now is,how do i ensure great perfomance when loading an STL map of a billion pages into memory.

There is no such thing as great performance working with huge files or huge amounts of files that I know of. How do you know the performance is not acceptable when your code has not worked with that amount yet? Are you sure you are not optimizing prematurely?

This article has been dead for over six months. Start a new discussion instead.