Hi all

I'm making a program (for my use only) which contains all the available information about atoms. I want to collect the information from Wikipedia, but if I copy & paste it takes my whole life!

For example, heres the Wikipedia page (sorry, it's in Finnish because I want to collect the data from that page) of Hydrogen. There is that long table on the right side where is the chemical information of that atom. Well, I want to make key-value pairs of the data. For example: The first row contains "Nimi: Vety" and the second one "Tunnus: H". I want to add an index to a PHP table like this

<?php
$info = array(array("Nimi", "Vety"), array("Tunnus", "H"));

Did you understand?

Well, thats what I'm doing. I hope you understand what I'm trying to say, because my English isn't so good - yet.

Recommended Answers

All 8 Replies


Did you understand?

............English isn't so good - yet.

Sorry I have no idea yet as to what you want to do!

I want to parse the table from the Wikipedia page and put it in a PHP array.

For example there are "Nimi: Vety" and "Tunnus: H". I want to put them in an array like this

<?php
$info = array(array("Nimi", "Vety"), array("Tunnus", "H"));

It will probably be easier if you paste the info into a spreadsheet, save it as a CSV file then read it into your PHP program and create your array. This may still take quite a bit of manual work if all of the info isn't in the same order / format. I don't think there is any solution for that unless you find a single source that has all of the data you need in a consistent format.

Member Avatar for diafol

Agree with chris, but as a chemist myself, I think I understand what you're trying to do - did the same myself once.

You're scraping data from Wikipedia to build a datatable?

If you drill down into the view source, you'll see this:

<td colspan="2" align="center"><b>Yleistä</b></td>
</tr>
<tr>
<td>Nimi</td>

If this is common to all element pages and is unique in all pages, you can use that as a label to start scraping.

I'm assuming you're using curl or file_get_contents() to grab the page.

You can use preg_* functions to extract each value using a search pattern, or there are xml functions you can use. Or, I haven't used it myself, you could use xpath (I think).

Yep, thanks ardav. I've already gotten the site with file_get_contents and I have parsed it quite ok with str_replace and preg_replace functions. Actually I have all the data in an array. There are keys and values, and now I should create a MySQL query to insert the data into a database.

I think I'll need to create multiple if-else conditions to get it work. If one page doesn't contain all the information and I just put the values from the array to a query, I get wrong data in the wrong place.

Should I initialize the array before I put data into it? The query would look like something like this...

INSERT INTO ... (nimi, tunnus, järjestysluku) VALUES($array[0], $array[1], $array[2])

But if there wasn't "tunnus" in the page, $array[1] would contain "järjestysluku". Then I'd put "järjestysluku" in "tunnus". That isn't what I want.

Member Avatar for diafol

Ensure the array has the right number of items and in the right order, so multiple ifs with provision of a default value if the data is absent from the page may be the way to go.

Personally I don't like scraping it's too fragile.

Have you looked at the Wiki API which gives live access to Wiki databases.

http://www.mediawiki.org/wiki/API:Main_page

I don't know if it will meet your requirements but it worth a look

Thanks for replies.

Now I have a parser that is almost working. Now I need to get the right part from the full source. There is a part like this in the Wikipedia page of Hydrogen.

<tr bgcolor="#E7E7E7">
<td colspan="2" align="center"><b>Yleistä</b></td>
</tr>
<tr>
<td>Nimi</td>
<td>Vety</td>
</tr>
<tr>
<td>Tunnus</td>
<td>H</td>
</tr>
<tr>
<td><a href="/wiki/J%C3%A4rjestysluku_(kemia)" title="Järjestysluku (kemia)">Järjestysluku</a></td>
<td>1</td>
</tr>
<tr>
<td><a href="/wiki/Luokka_(jaksollinen_j%C3%A4rjestelm%C3%A4)" title="Luokka (jaksollinen järjestelmä)" class="mw-redirect">Luokka</a></td>
<td><a href="/wiki/Ep%C3%A4metalli" title="Epämetalli">Epämetalli</a></td>
</tr>
<tr>
<td><a href="/wiki/Lohko_(jaksollinen_j%C3%A4rjestelm%C3%A4)" title="Lohko (jaksollinen järjestelmä)">Lohko</a></td>
<td>s-lohko</td>
</tr>
<tr>
<td><a href="/wiki/Ryhm%C3%A4_(jaksollinen_j%C3%A4rjestelm%C3%A4)" title="Ryhmä (jaksollinen järjestelmä)">Ryhmä</a></td>
<td>Luokitellaan yleensä kuuluvaksi <a href="/wiki/Alkalimetalli" title="Alkalimetalli">alkalimetallien</a> kanssa samaan ryhmään</td>
</tr>
<tr>
<td><a href="/wiki/Jakso_(jaksollinen_j%C3%A4rjestelm%C3%A4)" title="Jakso (jaksollinen järjestelmä)">Jakso</a></td>
<td>1</td>
</tr>
<tr>
<td><a href="/wiki/Tiheys" title="Tiheys">Tiheys</a></td>
<td>0,0000899×10<sup>3</sup> kg/m<sup>3</sup></td>
</tr>
<tr>
<td><a href="/wiki/V%C3%A4ri" title="Väri">Väri</a></td>
<td>Väritön</td>
</tr>
<tr>
<td>Löytövuosi, löytäjä</td>
<td><a href="/wiki/1766" title="1766">1766</a>, <a href="/wiki/Henry_Cavendish" title="Henry Cavendish">Henry Cavendish</a></td>
</tr>
<tr bgcolor="#E7E7E7">
<td colspan="2" align="center"><b>Atomiominaisuudet</b></td>
</tr>
<tr>
<td><a href="/wiki/Atomipaino" title="Atomipaino" class="mw-redirect">Atomipaino</a></td>
<td>1,00794 <a href="/wiki/Amu" title="Amu" class="mw-redirect">amu</a></td>
</tr>
<tr>
<td><a href="/wiki/Atomis%C3%A4de" title="Atomisäde">Atomisäde</a>, mitattu (laskennallinen)</td>
<td>25 (53) <a href="/wiki/Pikometri" title="Pikometri" class="mw-redirect">pm</a></td>
</tr>
<tr>
<td><a href="/wiki/Kovalenttis%C3%A4de" title="Kovalenttisäde" class="mw-redirect">Kovalenttisäde</a></td>
<td>37 pm</td>
</tr>
<tr>
<td><a href="/wiki/Van_der_Waalsin_s%C3%A4de" title="Van der Waalsin säde">Van der Waalsin säde</a></td>
<td>120 pm</td>
</tr>
<tr>
<td><a href="/wiki/Atomiorbitaali" title="Atomiorbitaali">Orbitaalirakenne</a></td>
<td>1s<sup>1</sup></td>
</tr>
<tr>
<td><a href="/wiki/Elektroni" title="Elektroni">Elektroneja</a> <a href="/wiki/Elektronikuori" title="Elektronikuori">elektronikuorilla</a></td>
<td>1</td>
</tr>
<tr>
<td><a href="/wiki/Hapetusluku" title="Hapetusluku">Hapetusluvut</a></td>
<td><b>+1</b>, -1</td>
</tr>
<tr>
<td><a href="/wiki/Kiderakenne" title="Kiderakenne">Kiderakenne</a></td>
<td>Heksagonaalinen</td>
</tr>
<tr bgcolor="#E7E7E7">
<td colspan="2" align="center"><b>Fysikaaliset ominaisuudet</b></td>
</tr>
<tr>
<td><a href="/wiki/Olomuoto" title="Olomuoto">Olomuoto</a></td>
<td>Kaasu</td>
</tr>
<tr>
<td><a href="/wiki/Sulamispiste" title="Sulamispiste">Sulamispiste</a></td>
<td>14,01&#160;<a href="/wiki/Kelvin" title="Kelvin">K</a> (-259,14&#160;<a href="/wiki/Celsiusaste" title="Celsiusaste">°C</a>)</td>
</tr>
<tr>
<td><a href="/wiki/Kiehumispiste" title="Kiehumispiste">Kiehumispiste</a></td>
<td>20,28&#160;K (-252,87&#160;°C)</td>
</tr>
<tr>
<td><a href="/wiki/Moolitilavuus" title="Moolitilavuus">Moolitilavuus</a></td>
<td>11,42×10<sup>−6</sup> m<sup>3</sup>/mol</td>
</tr>
<tr>
<td><a href="/wiki/H%C3%B6yrystymisl%C3%A4mp%C3%B6" title="Höyrystymislämpö">Höyrystymislämpö</a></td>
<td>0,904 kJ/mol</td>
</tr>
<tr>
<td><a href="/wiki/Sulamisl%C3%A4mp%C3%B6" title="Sulamislämpö">Sulamislämpö</a></td>
<td>0,117 kJ/mol</td>
</tr>
<tr>
<td><a href="/wiki/H%C3%B6yrynpaine" title="Höyrynpaine">Höyrynpaine</a></td>
<td>209 <a href="/wiki/Pascal_(yksikk%C3%B6)" title="Pascal (yksikkö)">Pa</a> 23 K:ssa</td>
</tr>
<tr>
<td><a href="/wiki/%C3%84%C3%A4nen_nopeus" title="Äänen nopeus">Äänen nopeus</a></td>
<td>1310 <a href="/wiki/Metri_sekunnissa" title="Metri sekunnissa">m/s</a> 300 K:ssa</td>
</tr>
<tr bgcolor="#E7E7E7">
<td colspan="2" align="center"><b>Muuta</b></td>
</tr>
<tr>
<td><a href="/wiki/Elektronegatiivisuus" title="Elektronegatiivisuus">Elektronegatiivisuus</a></td>
<td>2,1 (<a href="/wiki/Paulingin_asteikko" title="Paulingin asteikko" class="mw-redirect">Paulingin asteikko</a>)</td>
</tr>
<tr>
<td><a href="/wiki/Ominaisl%C3%A4mp%C3%B6kapasiteetti" title="Ominaislämpökapasiteetti">Ominaislämpökapasiteetti</a></td>
<td>14,304 (H<sub>2</sub>) kJ/kg&#160;K</td>
</tr>
<tr>
<td><a href="/wiki/L%C3%A4mm%C3%B6njohtavuus" title="Lämmönjohtavuus">Lämmönjohtavuus</a></td>
<td>0,1805 <a href="/wiki/Watti_kelvinmetri%C3%A4_kohti" title="Watti kelvinmetriä kohti" class="mw-redirect">W/(m×K)</a></td>
</tr>
<tr>
<td><a href="/wiki/CAS-numero" title="CAS-numero">CAS-numero</a></td>
<td>1333-74-0</td>
</tr>
<tr bgcolor="#E7E7E7">
<td colspan="2" align="center"><small>Tiedot <a href="/wiki/NTP" title="NTP">normaalipaineessa</a></small></td>
</tr>
</table>

This is what I want to get from the source. There is so much unnecessary code which I don't need. If you look into the code of the Wikipedia page, you can see that code there.

How should I get that from the code? file_get_contents gives me the full source, and now I need some kind of regular expression which gives me only that.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.