alright, so I recently obtained a reference implementation of MD5. I benchmarked it, and it ran about 30% faster than OpenSSL's MD5. The function call is as such:

fast_MD5(unsigned char *pData, int len, unsigned char *pDigest);

it worked great for a simple MD5 hash cracker, but I had trouble tweaking it to work for the MD5-HMAC for my Kerberos 5 cracker. The MD5-HMAC (using OpenSSL) involves the following:

MD5_Init(&ctx);
MD5_Update(&ctx,k_ipad,64);
MD5_Update(&ctx,(uint8_t*)&T,4);
MD5_Final(K1,&ctx);

MD5_Init(&ctx);
MD5_Update(&ctx,k_opad,64);
MD5_Update(&ctx,K1,16);
MD5_Final(K1,&ctx);

in addition to other steps. Note that the MD5_Update is called twice, in order to concatenate two buffers and decrypt the concatenated result. But the fast_MD5 function does not give me the flexibility to do it this way, so I tried this:

unsigned char feedbuffer[100];

memcpy(feedbuffer,k_ipad,64);
memcpy(feedbuffer+64,(uint8_t*)&T,4);
fast_MD5(feedbuffer,68,K1);

memcpy(feedbuffer,k_opad,64);
memcpy(feedbuffer+64,K1,16);
fast_MD5(feedbuffer,80,K1);

and it worked (take buffer1 and buffer2 and concatenate them into buffer3 using memcpy and then decrypt the resultant buffer3), but it ran slower than before when I was using OpenSSL!

I thought this might have something to do with RAM access speed versus L2 cache access speed, so I tried this, hoping it would fix the problem:

unsigned char feedbuffer[100];

for(int i=0;i<64;i++)
	feedbuffer[i]=k_ipad[i];
memcpy(feedbuffer+64,(uint8_t*)&T,4);//I didn't know how to convert this...
fast_MD5(feedbuffer,68,K1);

for(int i=0;i<64;i++)
	feedbuffer[i]=k_opad[i];
for(int i=0;i<16;i++)
	feedbuffer[i+64]=K1[i];
fast_MD5(feedbuffer,80,K1);

but again, it ran slower than before!

Is there some trick to accomplish what I am trying to accomplish faster? I don't understand, because I looked at the source code for the MD5 implementation in rfc1321 and the MD5_Update function used memcpy to process the input. So why should it run slower if I am performing the same operation outside the function, rather than inside?

Please help! Thanks!

Memory alignment?

See if your data is 16 byte aligned. If the internal algorithm is using SIMD it will run faster when aligned. If misaligned data has to be double copied, or accessed slower.
memcpy is also faster if properly aligned!

How do I make sure the data is aligned? Still sort of new to C++. I don't think the reference implementation uses SIMD, but I'll check. Is that even possible? In order to do SIMD MD5, wouldn't I have to pass in a vector of candidate passwords, rather than a single candidate password?

If it is possible to use an SIMD reference implementation and call the MD5 function is the same manner as normal (with a single plaintext rather than a vector of plaintexts) that would be awesome! Where could I get one of these? It would probably have to be specifically tailored to my processor, wouldn't it? Would it have to be in assembly language?

You mentioned a MD5 hash. If you analyze the algorithm you'll find that a degree of it can be done in parallel for a single hash.

AND of course if you have multiple hashes, they could be done simultaneously in parallel!

For fun, print the address of the hash buffer. If it ends in 0 0x43454230 then you're 16 byte aligned, but if non-zero, then you aren't!

Even if you aren't SIMD, if you aren't four byte aligned, you will definitely run slow. (Meaning hex address ending in {0, 4, 8, C})

Some compilers allow you to force local scope alignment at compile time. But for all compilers, merely over allocate!

You need a 64 byte buffer?

byte raw[ 64 + 15 ];
byte *pRaw = (byte *)(  ((int)(raw + 15))  & ~15);

(something like that, will force pRaw to be aligned on a 16 byte boundary!)

I checked and my reference implementation does not use SIMD, I'm pretty sure.

I printed the addresses of some of the buffers in the following manner:

cout<<"address of k_ipad is "<<&k_ipad<<"\n";
cout<<"address of k_opad is "<<&k_opad<<"\n";

I'm not sure if that is the correct way to do it. The address varied from run to run. The first time I got this:

002DF41C
002DF3CC

and the second time I got this:

0037F484
0037F434

Is this sufficient to tell if I am properly aligned?

I found this line of code in the reference implementation I am using:

// For the hash working space
//__attribute__((aligned(16))) UINT4 data[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
//__declspec(align(16)) UINT4 data[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
UINT4 data[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

__attribute__ doesn't work with my compiler, but __declspec__ does. I'm assuming this makes sure that data[64] is 16-byte aligned?

could I use a similar method to ensure that my buffers are properly aligned? Ex.

__declspec__((aligned(16)))unsigned char k_ipad[80];

(As a side note, I did a little benchmarking of my reference implementation with the __declspec__ enabled and disabled and this didn't seem to make any difference at all. Does this mean my compiler already ensures proper alignment? I am using Microsoft Visual C++ 2008 Express Edition).

Without them you're on four byte alignment!
Last hex digit of address indicates alignment off...
16 byte aligned {0}
8 byte aligned {0, 8}
4 byte aligned {0, 4, 8, c} ***your default***

Is assembly code in the fast_MD5? It could be well ordered assembly to minimize stalls, etc.


declspec should have aligned you to 16 byte, but if no difference then not your problem!

I figured out that the primary problem is that my implementation is optimized for lengths <32 chars, and I am using larger ones. For larger lengths, the performance benefit is not as defined.

I aligned all my unsigned char buffers using __declspec, but didn't see any performance increase :( I checked and indeed, I consistently had a hex address ending in zero. But no speedup. Any way I'm doing something wrong?

This question has already been answered. Start a new discussion instead.