I currently am using the optimized scaler 32-bit md4 and md5 implementations from here: http://freerainbowtables.com/phpBB3/viewtopic.php?p=8454#p8454. I am get ~8.1million hashes/second for md4, ~6.6million for md5.

I am looking for a sse2-accelerated implementation for core 2, primarily md5 - which is easy to implement and at the very least twice as fast as my current speed. This will probably compute multiple hashes in parallel - here is the format I am thinking:

each plaintext would be fed in in an array as follows:

char candidate[9];

the array would then be padded and the length appended, or whatever is necessary to prepare it for the md5 compress function. The candidate plaintexts should all be 32 bytes after padding (correct me if I am wrong). Each 32-byte plaintext would then be fed into a 2D array of the following structure:

unsigned char vect2enc[4][32];

this buffer would hold 4 32-byte padded plaintexts, for example. vect2enc would then be fed into a sse2-accelerated md5 compress function, and the resultant 32-byte md5 hashes would be stored back in their respective elements in vect2enc where I could refer to them one-by-one by array element.

The goal is to generate plaintexts one-by-one, pad them one-by-one, store them one-by-one in 2D vect2enc array, encrypt the entire array simultaneously in SIMD parallel, and then test each element one at a time with memcmp(). I need the padding code (which can be optimized for length as I will know the length of every plaintext before I encrypt it) and the compress function (which needn't encrypt only 4 hashes simultaneously - I have seen people get higher benchmarks encrypting 8 hashes simultaneously).

Anybody think they are up to it? Price estimate? Benchmark estimate? I am not committing to anything yet, just want to check around to see how much this would cost. And if I am misunderstanding this completely and for some reason it is impossible to do this in the method I have described, please educate me.

Post padding isn't the optimal solution. Pre-padding is!

Do a 32 bit write each write no loop!

unsigned char vect2enc[4][32];
uint32 *pAry = (uint32 *) vect2enc;

*(pAry+0) = 0;
*(pAry+1) = 0;
*(pAry+31) = 0;

Of course this would be much faster in assembly code.

mov eax,0
  mov [ebx+0],eax
  mov [ebx+4],eax
  mov [ebx+8],eax
   mov [ebx+124],eax

As to your SIMD. I've used SIMD in data encoding before but I thiink you're under a misconception.
SIMD is 128-bit so you can grab 16 characters at a time, but from one character string as they're sequential.
You would have to do four grabs, with a 16 byte offset, spend time swizzling the data.

Do a google search on AoS (Array of Structures) vs SoA (Structure of Arrays).

You can try to do this yourself. Take the original function and write it in simple C code but vectorize it. That is orient the data for your 16 character handling! Then orient the code to process day in that fashion! You'll need a working C code to bench mark the assembly code against so keep the original and vectorized C code around.

I think I found a solution that works for me. Thank you!