Hi,

I've never used PHP before so I don't even know if I'm posting in the right place, but I'm wondering if it's possible to scrape information from a dictionary-like Web page. I want to take a list of words and retrieve piece of information about each word, but the problem is that it's very tedious to enter in every word, click enter, wait for the page to load, and then copy the information I need from the next page into another file. Can this process be automated somehow? Would it be difficult to do?

Thanks for your help.

-tm

[Edit: If it wasn't clear, the big problem is that the Web page only allows me to enter one term at a time, of course.]

Without access to the website's database I'm thinking this would be very difficult to do. What you are essentially talking about may have legal ramifications also.

If you are talking about retrieving data from a dictionary webpage where you need to enter into a box one word at a time and storing earch word and meaning in a database then that should be really simple. If the keyword is shown in the url bar then you can use the file_get_contents() function or if the form uses method=post then you will need to use curl. Just post the webpage you would like this to be done on and I shall write you the script.

Awesome, thanks cwarn, that's exactly what I want to do. I'm using the site http://langtolang.com/, trying to go from English into Turkish. I'm hoping to make flash cards to study words. Oftentimes a term will give multiple results, in which case I'd like to be able to put two or three of the results on one side of the flash card. For example, searching for "saat" returns "time," "hour," "watch," "o'clock," etc., so I'd want to store up to the first three results of every search.

I have just completed the first version of the script. The main script is as follows:

<?
set_time_limit(0);
ini_set('magic_quotes_gpc','Off');
include('db.php');
mysql_connect($dbhost,$accountname,$password)
or die("Could not connect to MySQL server");
mysql_select_db($database) or die(mysql_error()." Could not select database");

$data=file_get_contents('wordlist.txt');
$words=explode("
",$data);
unset($data);

    
function generate($langto,$dword,$result) {
    $data=file_get_contents('http://langtolang.com/?selectFrom=english&selectTo='.$langto.'&txtLang='.$dword.'&submitButton=Search');
    $data=preg_replace('/(.*)\<tr class=(\"|\'|)title(\"|\'|)\>(.*)\<td([^\>]+)?\>[^\<]+\<\/td\>(.*)\<td([^\>]+)?\>[^\<]+\<\/td\>(.*)/is',"$1",$data);
    preg_match_all('/\<td width\=\"40%\"\>[^\<]+\<\/td\>/is',$data,$matches);
    unset($data);
    $dtranslations=preg_replace('/\<td width\=\"40%\"\>([^\<]+)\<\/td\>/i',"$1",$matches[0]);
    for ($a=0, $b=1; isset($dtranslations[$b]); $a+=2, $b+=2) {
        //if ($dtranslations[$a]==$dword) {
            $c=$a/2;
            if (count($result['english'])<=$c) {
                //$result['english'][]=$dtranslations[$a];
                $result['english'][]=$dword;
                }
            $result[$langto][]=$dtranslations[$b];
            //}
        }
    unset($dtranslations);
    return $result;
    }
echo '<head><title>Word Indexer</title></head><body>'.str_repeat(" ", 256).'<br>'; flush();
foreach ($words AS $word) {
$check = mysql_query('SELECT * FROM `translations` WHERE `english`="'.mysql_real_escape_string($word).'"') or die(mysql_error());
if (mysql_num_rows($check)==0) {
$time_start = microtime(true);
$translation = generate('albanian',$word,array());
$translation = generate('arabic',$word,$translation);
$translation = generate('breton',$word,$translation);
$translation = generate('catalan',$word,$translation);
$translation = generate('chinese_simplified',$word,$translation);
$translation = generate('chinese_traditional',$word,$translation);
$translation = generate('corsican',$word,$translation);
$translation = generate('czech',$word,$translation);
$translation = generate('danish',$word,$translation);
$translation = generate('dutch',$word,$translation);
$translation = generate('esperanto',$word,$translation);
$translation = generate('estonian',$word,$translation);
$translation = generate('finnish',$word,$translation);
$translation = generate('french',$word,$translation);
$translation = generate('gaelic',$word,$translation);
$translation = generate('georgian',$word,$translation);
$translation = generate('german',$word,$translation);
$translation = generate('greek',$word,$translation);
$translation = generate('hebrew',$word,$translation);
$translation = generate('hungarian',$word,$translation);
$translation = generate('icelandic',$word,$translation);
$translation = generate('indonesian',$word,$translation);
$translation = generate('italian',$word,$translation);
$translation = generate('japanese',$word,$translation);
$translation = generate('korean',$word,$translation);
$translation = generate('kurdish',$word,$translation);
$translation = generate('latvian',$word,$translation);
$translation = generate('lithuanian',$word,$translation);
$translation = generate('malagasy',$word,$translation);
$translation = generate('norwegian',$word,$translation);
$translation = generate('polish',$word,$translation);
$translation = generate('portuguese_brazil',$word,$translation);
$translation = generate('portuguese_portugal',$word,$translation);
$translation = generate('romanian',$word,$translation);
$translation = generate('russian',$word,$translation);
$translation = generate('serbo_croat',$word,$translation);
$translation = generate('slovak',$word,$translation);
$translation = generate('slovenian',$word,$translation);
$translation = generate('spanish',$word,$translation);
$translation = generate('swahili',$word,$translation);
$translation = generate('swedish',$word,$translation);
$translation = generate('turkish',$word,$translation);
$translation = generate('vietnamese',$word,$translation);
$translation = generate('yiddish',$word,$translation);
$translation = generate('walloon',$word,$translation);
$translation = generate('welsh',$word,$translation);

for ($i=0;isset($translation['english'][$i]);$i++) {
mysql_query('INSERT INTO `translations` SET `english`="'.mysql_real_escape_string($translation['english'][$i])
.'", `albanian`="'.mysql_real_escape_string($translation['albanian'][$i])
.'", `arabic`="'.mysql_real_escape_string($translation['arabic'][$i])
.'", `breton`="'.mysql_real_escape_string($translation['breton'][$i])
.'", `catalan`="'.mysql_real_escape_string($translation['catalan'][$i])
.'", `chinese_simplified`="'.mysql_real_escape_string($translation['chinese_simplified'][$i])
.'", `chinese_traditional`="'.mysql_real_escape_string($translation['chinese_traditional'][$i])
.'", `corsican`="'.mysql_real_escape_string($translation['corsican'][$i])
.'", `czech`="'.mysql_real_escape_string($translation['czech'][$i])
.'", `danish`="'.mysql_real_escape_string($translation['danish'][$i])
.'", `dutch`="'.mysql_real_escape_string($translation['dutch'][$i])
.'", `esperanto`="'.mysql_real_escape_string($translation['esperanto'][$i])
.'", `estonian`="'.mysql_real_escape_string($translation['estonian'][$i])
.'", `finnish`="'.mysql_real_escape_string($translation['finnish'][$i])
.'", `french`="'.mysql_real_escape_string($translation['french'][$i])
.'", `gaelic`="'.mysql_real_escape_string($translation['gaelic'][$i])
.'", `georgian`="'.mysql_real_escape_string($translation['georgian'][$i])
.'", `german`="'.mysql_real_escape_string($translation['german'][$i])
.'", `greek`="'.mysql_real_escape_string($translation['greek'][$i])
.'", `hebrew`="'.mysql_real_escape_string($translation['hebrew'][$i])
.'", `hungarian`="'.mysql_real_escape_string($translation['hungarian'][$i])
.'", `icelandic`="'.mysql_real_escape_string($translation['icelandic'][$i])
.'", `indonesian`="'.mysql_real_escape_string($translation['indonesian'][$i])
.'", `italian`="'.mysql_real_escape_string($translation['italian'][$i])
.'", `japanese`="'.mysql_real_escape_string($translation['japanese'][$i])
.'", `korean`="'.mysql_real_escape_string($translation['korean'][$i])
.'", `kurdish`="'.mysql_real_escape_string($translation['kurdish'][$i])
.'", `latvian`="'.mysql_real_escape_string($translation['latvian'][$i])
.'", `lithuanian`="'.mysql_real_escape_string($translation['lithuanian'][$i])
.'", `malagasy`="'.mysql_real_escape_string($translation['malagasy'][$i])
.'", `norwegian`="'.mysql_real_escape_string($translation['norwegian'][$i])
.'", `polish`="'.mysql_real_escape_string($translation['polish'][$i])
.'", `portuguese_brazil`="'.mysql_real_escape_string($translation['portuguese_brazil'][$i])
.'", `portuguese_portugal`="'.mysql_real_escape_string($translation['portuguese_portugal'][$i])
.'", `romanian`="'.mysql_real_escape_string($translation['romanian'][$i])
.'", `russian`="'.mysql_real_escape_string($translation['russian'][$i])
.'", `serbo_croat`="'.mysql_real_escape_string($translation['serbo_croat'][$i])
.'", `slovak`="'.mysql_real_escape_string($translation['slovak'][$i])
.'", `slovenian`="'.mysql_real_escape_string($translation['slovenian'][$i])
.'", `spanish`="'.mysql_real_escape_string($translation['spanish'][$i])
.'", `swahili`="'.mysql_real_escape_string($translation['swahili'][$i])
.'", `swedish`="'.mysql_real_escape_string($translation['swedish'][$i])
.'", `turkish`="'.mysql_real_escape_string($translation['turkish'][$i])
.'", `vietnamese`="'.mysql_real_escape_string($translation['vietnamese'][$i])
.'", `yiddish`="'.mysql_real_escape_string($translation['yiddish'][$i])
.'", `walloon`="'.mysql_real_escape_string($translation['walloon'][$i])
.'", `welsh`="'.mysql_real_escape_string($translation['welsh'][$i])
.'"') or die(mysql_error());
}
$time_end = microtime(true);
$time = $time_end - $time_start;
echo 'The word \''.$word.'\' took '.round($time).' seconds to append to the database.<br>';
/*echo '<xmp>';
print_r($translation);
echo '</xmp>';*/
unset($translation);
unset($time_start);
unset($time_end);
unset($time);
flush();
}
}
?>

Then in db.php place the following code and configure the variables to the mysql database:

<?
//configure below variables to your mysql database.
$dbhost='localhost';
$accountname='root';
$password='';
$database='database_name';
?>

And in the same directory/folder as those two files, place a file named wordlist.txt with a list of all the words to scan and index. And if you like, you can download all of this as a zip file attached to this post.
Note: Due to the number of webpages per set of translations that need to be viewed, it takes 2 to 3 minutes to translate each word into all the languages which makes an average of 4 seconds per downloaded page.

Comments
very thorough solution to my problem

wow, thanks for all of your help. i'm going to read up on php so i can figure out how to implement this stuff.

This question has already been answered. Start a new discussion instead.