Hello,
I have a difficult memory error to find. I have narrowed it
down to a small piece of code, tiny in fact, but the symptoms
are baffling to me. I tried to find it with electric fence, but
electric fence found nothing. The problem is complicated by
my liberal use of the STL, in particular deques and maps.

I have written a large bit of calibration code that works
satisfactorily. I am now embedding it into a production system
which is to run nightly. The production system has a pointer to
the calibration class, which gets initialized to zero in the constructor,
and then is set equal to new CalibratorClass(...) after some
(substantial) calculations are done in another function.

The calculations are done successfully, and the pointer is
initialized pointing to new, healthy-looking (in the debugger)
CalibratorClass. Near the beginning of the next function call,
a member function of this class is called (it initializes some more of
the data members). The instance of the CalibratorClass looks healthy
right up until the call. However, once we step inside, the class
is corrupted. The reason is that the "this" pointer does NOT correspond
to the healthy looking object I examined the line before, but is
a completely different pointer. Somehow at the function call
the "this" pointer got changed.

Can anyone suggest an avenue to explore, here? The only things I can
think of, here, are that the vtable got corrupted, but this member
function isn't virtual. My next step is to try Rational Purify, but
it looks VERY expensive (~$6000). Are there any other free/cheap
memory-error finders anyone can recommend? Does this symptom
sound like anything familiar?

Dave

This really is eating me inside because I remember seeing the exact same symptoms in a project. It was a very long time ago, and I'm having trouble remembering what the problem was, but I do remember it was painful to find. I can at least say a few things:

First, it is not a corruption of the vtable, because you mentioned that you did inspect the this pointer within the called member function, you would not be able to reach that point if the vtable was corrupt.

Second, from your description, there is no reason why this shouldn't work. There is something you neglected to mention. The devil is in the details.

You must first wield out any possible API and ABI mismatch problems:

First, make sure there are no old versions of the source code or header files that are remaining in your code-base, they are liable to silently corrupt the build.

Second, make sure that you clean and purge all your binaries and that you make a fresh build of the entire code-base. It goes without saying that all the code has to be build with the same compiler and the same compiler options.

The purpose of the above two steps is obviously to make sure that you don't have things like function prototypes in a header that doesn't correspond to the compiled code but links to it anyways (yes, it is possible). The second problem is ABI mismatch, meaning that some compiled code expects a different layout for a class than another piece of compiled code. These things happen some times when incrementally building code or when you have stray source files of an older version (sometimes a benign change in include-path order causes the wrong header to be used).

I'm mentioning this because one possible source for the problem is a mismatch of the calling convention used. What is the calling convention? Where did you inspect the values of the "this" pointer before and after the call, on the stack or on the ecx register?


Another thing I suspect is a problem related to cross-modular code. Running code between different executables and DLLs is much more subtle and troublesome than you might expect. And I was having this problem when writing cross-modular code.

For example, if you are allocating dynamic memory or exchanging STL components between DLLs and/or executables, then you are in for a world of trouble.


Another potential source of the problem is stack corruption. If you say that the pointer value changes, it means that it is either read from the wrong place by the callee (mismatched calling convention) or it is overwritten some time after the member function pointer look-up in the vtable, and before the function call. If the pointer is passed on the stack (as in cdecl), it might reside on the stack for a little while before the function call (as an optimization) and maybe you are overwriting it before the call. Did you inspect the stack frame before the call?


These are all the potential things I can think of, you definitely need to provide more details about the code, and some relevant parts of it, if you can.

Hi and thanks for the replies, especially for Mike's thoughtful reply -- this is the
kind of advanced insight I was looking for.

Being forced to cull this problem out of my rather large (large for me) code base forced me to do something I should have done, which is comment out bits and pieces
and see if it couldn't be narrowed down the old-fashioned way.

I did and for those that are interested, the resolution was this (much
more boringly prosaic than Mike's interesting suggestions):

I was using a map<int,double> to store arrival probabilities of hits
of a certain size -- since the hit sizes aren't known a priori this
was a simple way to do it, just add in a new element to the map when
a hit size you haven't seen before arrives.

At the end, though, I needed to go back, add up all the hits, and divide
by the total. I did it like this

for(map<int,double>::iterator sizeFreq = _sizeFreqHitMap.begin();
sizeFreq != _sizeFreqHitMap.end(); sizeFreq++)
sizeFreq->second /= totalProb;

Apparently, you're not allowed to do this. I looked at it and began to wonder
if the iterator can write back to the map, and I wasn't sure, so I rewrote it as


for(map<int,double>::iterator sizeFreq = _sizeFreqHitMap.begin();
sizeFreq != _sizeFreqHitMap.end(); sizeFreq++)
_sizeFreqHitMap[sizeFreq->first] = sizeFreq->second / totalProb;

And the problem went away. I still don't really understand
why this is wrong usage, but if so, perhaps the compiler should
catch this. Any insights you guys have would be greatly appreciated.

Dave

>>Apparently, you're not allowed to do this.

I doubt that very much. What is your source for this information?
As far as I know, and I'm pretty sure of this, there is nothing wrong with that loop. You are certainly allowed to modify the value element (second) with a map iterator (of course, the "first" part cannot be modified, it is const). Especially considering the fact that the alternative (the loop you showed) is far less efficient, it runs in O(NlogN) instead of O(N) for the original loop.

The problem that you are having is memory corruption, and one of the very annoying things about memory corruption problems is that small changes in the code changes the memory layout a little bit, and makes the problem "invisible", but it is still there. This is the case for your code. You took a perfectly valid piece of code, replaced it by another perfectly valid piece of code, and, all of a sudden, the problem disappears. It didn't disappear, the memory corruption is simply no longer affecting anything critical. But this is a time-bomb. There is no telling what is being corrupted and what effect that might have (at best, it does nothing, at worse it corrupts the results of your calculation). And, more importantly, you could make another trivial modification to your code, and the memory corruption problem will come back. This is not over.

You must review the code very carefully, and try to spot any code that might write to memory that it should not write to (like an uninitialized pointer, going out-of-bounds of an array or vector, etc.).

Ok, Mike, thanks, I'll take your warning. Looks like a job for Purify,
unless you can suggest a better/cheaper tool?

This question has already been answered. Start a new discussion instead.