Like Veedoo I'm not sure of the question but I started playing around with HanDeDict dictionary (1) and came up with a script to search "rapidly". I don't know if this can be helpful for you.
I'm using memcached to load data in memory and igbinary as serializer to save space. First of all I downloaded your same dictionary and in the first script I converted it to an array, and splitted all rows into files of 5000 words each, so at the end I get 30 files of 500kb each one or so.
Memcached has a limit of 1mb per each item, this is why I decided to split them. Together with this array I created another one of only keys, used to perform the searches, this array (splitted as the previous) is hashed: I used sha1() on each key (ie Chinese word).
In the second script I load all keys files in memory. The data files (dictionary arrays) are loaded on demand in the third script, but if you have enough memory you can load everything.. in my opinion is better to load on request, so you don't occupy to much memory.
Now: the first time a term is searched, the key arrays are queried from memory, until there's a match, when this happens, the script saves the result in memory, singularly, so the next time, the same word, will be outputted much faster than previous.
Here are the scripts:
<?php
$file = "dct.u8"; # handedict_nb.u8
$line = file($file);
$c = count($line);
$a = array(); # all data
$b = array(); # only keys
$size = 5000;
$nsize = 0;
$s = 1;
for($i = 0; $i < $c; $i++)
{
$py = '';
$de = '';
$word = '';
if(preg_match('/\[(.*)\]/s',$line[$i],$r))
{
$py = trim($r[1]);
}
if(preg_match('/\/(.*)\//s',$line[$i],$r))
{
$de = trim($r[1]);
}
if(preg_match('/^(.*)\[/s',$line[$i],$r))
{
$word = trim($r[1]);
}
if($word == true && $py == true && $de == true)
{
$a[] = array($word => array('py' => $py,'de' => $de));
$b[] = sha1($word); # hashing
}
if($size == $i) # first block
{
$s = 2;
$nsize = $size * $s;
$rsl = $s;
$nm = (strlen($rsl) == 1) ? '0'.($s-1): '0'.($s-1);
file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
$a = '';
$b = '';
}
if($nsize == $i)
{
$s++;
$nsize = $size * $s;
$rsl = $s;
$nm = (strlen($rsl) == 1) ? '0'.($s-1): $s;
file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
$a = '';
$b = '';
}
if($i == $c-1) # last block
{
$s++;
$rsl = $s;
$nm = (strlen($rsl) == 1) ? '0'.($s-1): $s;
file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
}
}
echo 'done';
?>
Second script (load in memory):
<?php
$m = new Memcached();
$m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT);
$m->addServer('127.0.0.1',11211);
$m->flush(); # clear memory
$files = glob('igdict_*.txt');
$key_files = glob('keys_dict_*.txt');
$c = count($files);
for($i = 0; $i < $c; $i++)
{
# setting 0 as third parameter into $m->add doesn't timeout the item
# it will be deleted only if memory limit is reached
# uncomment below to load all data in memory
/*
$a = igbinary_unserialize(file_get_contents($files[$i]));
$m->add('dictionary'.$i, igbinary_serialize($a), 0);
*/
# loading keys
$b = igbinary_unserialize(file_get_contents($key_files[$i]));
$m->add('keys'.$i, igbinary_serialize($b), 0);
$m->add('count_files',$c,0);
}
echo 'done';
?>
Third script (search data):
<?php
$m = new Memcached();
$m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT);
$m->addServer('127.0.0.1',11211);
$string = '世塵 世尘'; # term to search
$hash = sha1($string);
if($m->get($hash))
{
print_r(igbinary_unserialize($m->get($hash))); # saved result
}
else
{
$c = $m->get('count_files');
for($i = 0; $i < $c; $i++)
{
# uncomment below to read all data
# $a = igbinary_unserialize($m->get('dictionary'.$i));
$b = igbinary_unserialize($m->get('keys'.$i));
$b1 = count($b);
for($i2 = 0; $i2 < $b1; $i2++)
{
if($hash == $b[$i2])
{
# comment below if in loader script you're loading all files
$a = igbinary_unserialize(file_get_contents('igdict_'.$i.'.txt')); # load on request
$r = $a[$i2];
print_r($r); # print result
$m->add($hash, igbinary_serialize($r),300); # save to memory for 300 seconds
$i = $c; # stop loop
}
}
}
}
?>
If you can't use igbinary then use json_encode/decode or the normal serialize/unserialize shipped with PHP. Igbinary is faster than the others and save more space.
If you can't use Memcached, you can use MySQL: load these data into a table with Memory engine and enable Query Cache. This can speed things but the database will continue to work. With Memcached you can do some work without touching the database.
Note: it would be nice to store directly each hash key in memory, but there are some limits on memory usage by Memcached, when you reach those limits, older items are overwritten by new ones.
Hope is useful, bye :)
(1) http://www.handedict.de/chinesisch_deutsch.php?mode=dl