I currently am using the optimized scaler 32-bit md4 and md5 implementations from here: http://freerainbowtables.com/phpBB3/viewtopic.php?p=8454#p8454. I am get ~8.1million hashes/second for md4, ~6.6million for md5.
I am looking for a sse2-accelerated implementation for core 2, primarily md5 - which is easy to implement and at the very least twice as fast as my current speed. This will probably compute multiple hashes in parallel - here is the format I am thinking:
each plaintext would be fed in in an array as follows:
char candidate; candidate="plaintext";
the array would then be padded and the length appended, or whatever is necessary to prepare it for the md5 compress function. The candidate plaintexts should all be 32 bytes after padding (correct me if I am wrong). Each 32-byte plaintext would then be fed into a 2D array of the following structure:
unsigned char vect2enc;
this buffer would hold 4 32-byte padded plaintexts, for example. vect2enc would then be fed into a sse2-accelerated md5 compress function, and the resultant 32-byte md5 hashes would be stored back in their respective elements in vect2enc where I could refer to them one-by-one by array element.
The goal is to generate plaintexts one-by-one, pad them one-by-one, store them one-by-one in 2D vect2enc array, encrypt the entire array simultaneously in SIMD parallel, and then test each element one at a time with memcmp(). I need the padding code (which can be optimized for length as I will know the length of every plaintext before I encrypt it) and the compress function (which needn't encrypt only 4 hashes simultaneously - I have seen people get higher benchmarks encrypting 8 hashes simultaneously).
Anybody think they are up to it? Price estimate? Benchmark estimate? I am not committing to anything yet, just want to check around to see how much this would cost. And if I am misunderstanding this completely and for some reason it is impossible to do this in the method I have described, please educate me.