I'm trying to work out what the best structure is for these requirements.. will happily work on it, just wondering on the best direction to go in / if anyone knows from experience or reason what the best type of storage structure is..

- I 'generate' up to 20,000 integers, and I need to make sure they are unique. I have no desire to keep track of which integers I've seen, as long as I can determine whether I've seen one before, if that makes sense - I don't need to get them back again from the structure.

- I can roughly predict the distribution of the numbers, I know, for example, the possible 'next' numbers after seeing any given number, I also know that the numbers probably expand from a single point, i.e, in 1-D, if the first number is one, the second numbers will be 0 and 2, if the first number is 2, the next ones will be 1 and 3, and so on.. I'm not considering negatives. However, I only have a rough idea that will happen - and, its certainly the case that the next two numbers might be +100 and -100, or any other symetric distance from the first number ( again, negatives can't happen, so -100 would only be possible if the first number is >= 100 ). The numbers don't necessarily start from 0, infact, its very unlikely that the first number will be 0.

At the moment I'm using a set [ red/black tree ], all is well for one test - and 20,000 is a worst case situation, i.e. the algorithm this is for aborts if it reaches 20,000 iterations. 300-1000 numbers is more likely; but, there's a chance that, every half second, a worst case example occurs, and worst cases aren't infrequent. Also, I want to increase the worst case level, because 20,0000 is sometimes not enough! In profiling, the set looks like a speed bottleneck in the system at these extreme cases, and I guess it's a memory bottleneck too, since it must store a whole integer at every 'leaf'.

By the way, a matrix/vector with a flag bit for every possible input isn't gonna be good ( although there is a well defined non-infinite input range ), because I dont want every case to suffer just to make the worst cases more efficient, and in most cases, only a relatively tiny number of integers are ever seen. Anything 'fuzzy' isn't good either, i.e. imperfect hashing -- since I can't forget I've seen a number, or remember I've seen a number that I haven't -- and I can't make a perfect hash that isn't the same size as the inputs.

I'm thinking, of using a dynamic vector or link list of ordered ranges; that is - if I see 0, 1, 2, 3, 4, 5, 6 - i would have one range object ( 0 - 6 ) at the beggining of the list, so anything within that range can be ignored; equally, if I see 20, 21, 28, 29, 30, i'd have two range objects ( 20 - 21 ) and ( 28 - 30 ) somewhere at the front of the list.. Kinda like file compression/allocation in realtime. Obv. that has worst memory case of 2 x the number of possible values, but, that's unlikely... Also, that structure might end up having a long lookup or write time, depending on whether it's a linklist or a vector. Maybe I could make it hierachical, but then it's gonna end up looking alot like a set...

Anyway.. anyone know of a better way to do this, or have any ideas?

> By the way, a matrix/vector with a flag bit for every possible input isn't gonna be good
Why not? unsigned long flags[625]; // 20,000 bits. Testing to see if 'val' has been seen is just

if ( flags[ val / 32 ] & ( 1ul << ( val % 32 ) ) ) {
  // seen it

It might look long winded, but any decent compiler will optimise all those operations involving 32 into relevant bit-wise operations.
The result is an expression which evaluates in a handful of machine instructions in constant time.

All that red/black tree and lists of sub-ranges looks like complication for the sake of it.

The code is simple, the data structure is compact.

Hm, I have looked into using arrays of uints with bit indexing, but, the 20,000 limit is the max number of iterations ( the maximum number of numbers, not the maximum possible number ), the bounds for possible numbers gets much higher. The numbers represent 3D points fitted to a given resolution, in all directions. An area of size 100 x 100 x 100 ( with 1 unit per discrete point ) isn't unreasonable, and about the maximum space ever is probably bounded at 1000 x 1000 x 100. That would be 3,125,000 uint32_ts allocated for _any_ operation in that area; even though the actual number of points explored is only ever going to be 20000; that being an artificial limit solely because letting the algorithm run throughout the entire space isn't feasible, but, it has to be allowed to explore any part of the space that it chooses to.

I've looked at using small block allocated arrays of about 100 unsigned ints at a time, starting at around the first number and then potentially expanding in any of 4 directions ( because 0 actually lies right in the middle of the area ); but, the fact that so much space is wasted ( examining the memory shows huge swathes of 0's ), leaves me a little dis-satisfied with that approach.

It's called the space-time trade off.

12MB isn't much at all for the average desktop system. If you've got the memory and it gives you quick results, I'd say go for it.

How much memory are you wasting with a RB tree maintaining all the overheads of the data structures, and only using say only 15 bits of a 32-bit integer?

Have you calculated the size of your completed alternate data structures?

Well, the set/rb tree does get massive if it ever reaches 20,000 iterations. Way more than storing 20,000 discrete spaces as an array of bits.

I think, I'm gonna look into using small arrays again; so... I'm gonna convert each discrete point into a cell index, a word index ( 32 or 64 bit words ) and an offset ( the bit ). If I keep a sparse array of pointers to dynamically allocated arrays of words ( cells ), one cell for every 100 words for example, then the space can grow in blocks of KBs, rather than starting off being MBs. Since it can be sparse at the highest level, all the untouched chunks of 100 words don't ever need to exist - any 'seen' query to those cells is false. It still might turn out that the cells of words that do exist are mostly 0's, though.

I haven't started on the range structure, I'm just toying with the idea, really. If it's always the case that space fills in contiguous blocks of 1s, it might be OK, but, it needs a sizeable logic overhead compared to either a set, or array(s). Also, if space doesn't fill up in contiguous blocks, then it'd be a terrible structure, since it'll have to store 2 values for every detached point. The worst case would then be, assuming 20,000 non contiguous points have been seen, that the size of the structure was 40,000 index types ( potentially 64-bit indexes ) plus any constant memory requirement of the structure. However, if anything up to MAX_LONG points are seen, and they are all contiguous, it would only have a mem requirement of 2 index types. The lookup time would be linear with respect to the number of points seen, but the write time might be complex since the structure might have to totally reorganize, to say, insert a range in the middle of tight existing ranges.

So, I will perhaps do both, compare and let you know the results. At the moment, I'm considering it might be a premature optimization, but, I haven't really pushed this algorithm as of yet, so, it may yet become critical.

Just one question.
How many times are you going to be running this code?

On a bad day, once or twice a second... Multiple agents may request the use of this algorithm, multiple times during a simulation. A badly configured agent could ask for this algorithm repeatadly ( every 20Hz update cycle ). A well configured agent shouldn't ask for it more than 10 times a minute, but, with multiple agents, they have to stagger the request to this algorithm, or they all ask at the same cycle. Normally that's fine, since the algorithm rarely has more than a 100 x 100 x 1 search space, and secondly, the likely number of iterations is either 100-1000, or exactly 20,000 ( at the cut off ). So, when no worst cases arise, its quite speedy with a small memory footprint, even when a set is used.

It doesn't have to be threadsafe, though, so, I could feasibly re-use any memory I allocate, if it proved to be worthwhile to do so.

In case you're interested, the algorithm in question is the A* graph search in dynamic 3D space, I don't keep a grid/graph of such representing the 3D space, so this structure represents my 'closed list', which is built on the fly by assigning each discrete point in the search space ( at a given resolution ) a unique identifier, the identifiers are just relative 3D co-ordinates of each point in the space, represented as a single index. The identifer of each visited point in the space gets put in the closed list, because visiting a point represents following the shortest weighted path to that point ( weights are generated by consulting the 'real' configuration of the space ). If there's a clear path from the initial point to the goal point, the algorithm terminates very quickly and without covering many points. If the space is full of local minima, or worse, if there's no path to the goal, the algorithm does a breadth first search of the entire search space, so at 20,000 iterations, it's just the tip of the iceburg, so to speak.

As for strategies to deal with this that don't involve the structure itself, there are a multitude, first, the space can be searched at an enormous resolution, meaning that a 100 x 100 x 1 space could cover the entire environment, only considering large obstacles, and then using reactive steering to avoid smaller obstacles; second the space can be searched using a true node-arc graph rather than a graph interface which serves points and adjacencies on the fly ( but, generating a true node-arc graph based on obstacle triangulation proved more costly than using a proxy graph that thinks it has a node at every discrete point in reference cases ). Also, the results can be ( are ) cached by the agents. Most probably, I will limit this type of search to 2D, and use node-arc graphs for multiple 3D layers or wide space searches, or cut out the 'up' component in 3D environments, where its possible to do so.

But, in summary, this is an algorithm that the agents use quite frequently, so a well planned store of the vistied node indices for both 'real graphs' and graphs that just 'pretend to exist', as this one does, is quite important.