0

Hi,

I am currently working on a websiteproject for learning chinese. To help the readers to understand the texts better, I want to annotate the Chinese Characters with popup translations from the HanDeDict (German-Chinese Dictionary).

I already downloaded the dictionary but the file seems to be too big to work with properly.

To get an impression of what I mean you can visit: http://www.dersinologe.de/hdd1py-working3 (Just mouseover the chinese text at the very end of the site and some tooltip dialoges will show)

Now I need your help. How can I annotate the whole text? Below you can see my php-code so far. Is there a better way of doing it? The biggest problem seems to be working with the Dictionary file. As the file is about 12 MB big, I want to pre-annotate the text and then upload the annotated text - so the site won't load for hours.

<?php
$pos=0;
$len=6;
$text='成人業余高中中国中成人玩具的人二千七百成人業余高中成人業余高中零一就像小炸弹,不二流子小心就会爆炸。 虽然他们的样子看起来很平和,但阿巴多是心里藏二流子着很大的攻击性。几十年二千七百零一前毛成人玩具泽东就利用了这一点去建立他心目中的中国,但也是因为这一点很多人在1949年至1977年间失去了生命。攻击性其实是一种很可怕的武器,没人控制得了。一旦失控就不容易停止,要等到人们的情绪慢慢平静下来 。';
echo '$text = <br>'.$text.'<br /><br />';
$textpart=mb_substr($text,$pos,$len,'UTF-8'); 
$str='';


while($pos<=mb_strlen($text,'UTF-8')){
$textpart=mb_substr($text,$pos,$len,'UTF-8');

if (!$dict[$textpart]['de'] && $len>=1){
$len--;
}
elseif ($dict[$textpart]['de'] && $len>=1){
$textpart='<span class="tttword">'.$textpart.'<span class="ttt">'.$textpart.' - '.$dict[$textpart]['py'].'<br>'.$dict[$textpart]['de'].'</span></span>';
$str=$str.$textpart;
$pos=$pos+$len;
$len=5;
}
elseif ($len<=0){
$textpart=mb_substr($text,$pos,1,'UTF-8');
$str=$str.$textpart;
$pos++;
$len=6;
}
}

echo $str;

?>

I hope you understand my problem and can point me into the right direction, maybe even give some useful code/links/persons that could help me ;-)

Thanks,
Malaoshi

3
Contributors
4
Replies
5
Views
5 Years
Discussion Span
Last Post by cereal
0

Just a suggestion, not even sure if I am understanding your question pretty well.

Make a language directory..and in this directory assigned the language files for pages..you can break them as many file as you want.

Then this

$text='成人業余高中中国中成人玩具的人二千七百成人業余高中成人業余高中零一就像小炸弹,不二流子小心就会爆炸。 虽然他们的样子看起来很平和,但阿巴多是心里藏二流子着很大的攻击性。几十年二千七百零一前毛成人玩具泽东就利用了这一点去建立他心目中的中国,但也是因为这一点很多人在1949年至1977年间失去了生命。攻击性其实是一种很可怕的武器,没人控制得了。一旦失控就不容易停止,要等到人们的情绪慢慢平静下来 。';

assign to something related to the its use.

$lang['this_text'] = '成人業余高中中国中成人玩具的人二千七百成人業余高中成人業余高中零一就像小炸弹,不二流子小心就会爆炸。 虽然他们的样子看起来很平和,但阿巴多是心里藏二流子着很大的攻击性。几十年二千七百零一前毛成人玩具泽东就利用了这一点去建立他心目中的中国,但也是因为这一点很多人在1949年至1977年间失去了生命。攻击性其实是一种很可怕的武器,没人控制得了。一旦失控就不容易停止,要等到人们的情绪慢慢平静下来 。';

save the above to your language file..and then just include the language file to your php document needing it and then just call it like this..

$text = $lang;


By doing this, the page don't have to wait for the annotations.

Edited by veedeoo: n/a

0

Like Veedoo I'm not sure of the question but I started playing around with HanDeDict dictionary (1) and came up with a script to search "rapidly". I don't know if this can be helpful for you.

I'm using memcached to load data in memory and igbinary as serializer to save space. First of all I downloaded your same dictionary and in the first script I converted it to an array, and splitted all rows into files of 5000 words each, so at the end I get 30 files of 500kb each one or so.

Memcached has a limit of 1mb per each item, this is why I decided to split them. Together with this array I created another one of only keys, used to perform the searches, this array (splitted as the previous) is hashed: I used sha1() on each key (ie Chinese word).

In the second script I load all keys files in memory. The data files (dictionary arrays) are loaded on demand in the third script, but if you have enough memory you can load everything.. in my opinion is better to load on request, so you don't occupy to much memory.

Now: the first time a term is searched, the key arrays are queried from memory, until there's a match, when this happens, the script saves the result in memory, singularly, so the next time, the same word, will be outputted much faster than previous.

Here are the scripts:

<?php
$file = "dct.u8"; # handedict_nb.u8
$line = file($file);
$c = count($line);
$a = array(); # all data
$b = array(); # only keys

$size = 5000;
$nsize = 0;
$s = 1;

for($i = 0; $i < $c; $i++)
{
	$py = '';
	$de = '';
	$word = '';
	if(preg_match('/\[(.*)\]/s',$line[$i],$r))
	{
		$py = trim($r[1]);
	}
	
	if(preg_match('/\/(.*)\//s',$line[$i],$r))
	{
		$de = trim($r[1]);
	}
	
	if(preg_match('/^(.*)\[/s',$line[$i],$r))
	{
		$word = trim($r[1]);
	}

	if($word == true && $py == true && $de == true)
	{
		$a[] = array($word => array('py' => $py,'de' => $de));
		$b[] = sha1($word); # hashing
	}

	if($size == $i) # first block
	{
		$s = 2;
		$nsize = $size * $s;
		$rsl = $s;
		$nm = (strlen($rsl) == 1) ? '0'.($s-1): '0'.($s-1);
		file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
		file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
		$a = '';
		$b = '';
	}

	if($nsize == $i)
	{
		$s++;
		$nsize = $size * $s;
		$rsl = $s;
		$nm = (strlen($rsl) == 1) ? '0'.($s-1): $s;
		file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
		file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
		$a = '';
		$b = '';
	}

	if($i == $c-1) # last block
	{
		$s++;
		$rsl = $s;
		$nm = (strlen($rsl) == 1) ? '0'.($s-1): $s;
		file_put_contents('igdict_'. $nm .'.txt', igbinary_serialize($a), LOCK_EX);
		file_put_contents('keys_dict_'. $nm .'.txt', igbinary_serialize($b), LOCK_EX);
	}

}

echo 'done';

?>

Second script (load in memory):

<?php
$m = new Memcached();
$m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT);
$m->addServer('127.0.0.1',11211);

$m->flush(); # clear memory

$files = glob('igdict_*.txt');
$key_files = glob('keys_dict_*.txt');
$c = count($files);

for($i = 0; $i < $c; $i++)
{

        # setting 0 as third parameter into $m->add doesn't timeout the item
        # it will be deleted only if memory limit is reached
        
        # uncomment below to load all data in memory
	/*
	$a = igbinary_unserialize(file_get_contents($files[$i]));
	$m->add('dictionary'.$i, igbinary_serialize($a), 0);
	*/

	# loading keys
	$b = igbinary_unserialize(file_get_contents($key_files[$i]));
	$m->add('keys'.$i, igbinary_serialize($b), 0);
	$m->add('count_files',$c,0);
}

echo 'done';
?>

Third script (search data):

<?php
$m = new Memcached();
$m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT);
$m->addServer('127.0.0.1',11211);

$string = '世塵 世尘'; # term to search
$hash = sha1($string);

if($m->get($hash))
{
	print_r(igbinary_unserialize($m->get($hash))); # saved result
}

else
{

        $c = $m->get('count_files');
	for($i = 0; $i < $c; $i++)
	{
                # uncomment below to read all data
		# $a = igbinary_unserialize($m->get('dictionary'.$i));
		$b = igbinary_unserialize($m->get('keys'.$i));
		$b1 = count($b);
	
		for($i2 = 0; $i2 < $b1; $i2++)
		{
			if($hash == $b[$i2])
			{
                                # comment below if in loader script you're loading all files
                                $a = igbinary_unserialize(file_get_contents('igdict_'.$i.'.txt')); # load on request
				$r = $a[$i2];
				print_r($r); # print result
				$m->add($hash, igbinary_serialize($r),300); # save to memory for 300 seconds
				$i = $c; # stop loop
			}
		}	
	}
}

?>

If you can't use igbinary then use json_encode/decode or the normal serialize/unserialize shipped with PHP. Igbinary is faster than the others and save more space.

If you can't use Memcached, you can use MySQL: load these data into a table with Memory engine and enable Query Cache. This can speed things but the database will continue to work. With Memcached you can do some work without touching the database.

Note: it would be nice to store directly each hash key in memory, but there are some limits on memory usage by Memcached, when you reach those limits, older items are overwritten by new ones.

Hope is useful, bye :)


(1) http://www.handedict.de/chinesisch_deutsch.php?mode=dl

0

Hi veedeoo and cereal,

thanks for the help. I will look into your solutions and see if I can make them work in my project. For now I started programming the same thing with c# but again the file size seems to be the problem.

I will try out your way cereal but as am not as advanced in php or programming in general I guess this will take me some time ;-)

Once again, thanks for your fast answers :-)

0

Ok, if you want to try my solution just install Memcached (be aware: Memcache is another product) and Igbinary:

- http://memcached.org/
- https://github.com/phadej/igbinary

Then configure Memcached daemon and run those scripts. The first is needed only to create the files to be used in the second script. This second script it can be ran once a day just to make sure the keys are still in memory, and the third script is what you can use to find data.. anyway mine was just a test, good luck! :)

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.