I have a table called volume_issue that looks like this:

 +----+-------------------------------------------------------+-----------------+-----------+---------------------+
| id | url                                                   | journal_title   | issn      | volume_issue        |
+----+-------------------------------------------------------+-----------------+-----------+---------------------+
|  1 | https://www.ajol.info/index.php/asan/issue/view/17048 | Africa Sanguine | 1560-8646 | Vol 19, No 2 (2018) |
|  2 | https://www.ajol.info/index.php/asan/issue/view/16693 | Africa Sanguine | 1560-8646 | Vol 19, No 1 (2017) |
+----+-------------------------------------------------------+-----------------+-----------+---------------------+

I am trying to extract all records associated with the urls listed in the variable url and insert them in a table known as citations. The code i am using to do this lookw like this:

<?php
$query1 = "SELECT * FROM volume_issue where id between 1 and 2";
$sql = $con->prepare($query1);
$sql->execute();
$sql->SetFetchMode(PDO::FETCH_ASSOC);
while ($row = $sql->fetch()) {
    $volume_issue_id = $row['id'];
    $url = $row['url'];
    $volume_issue = $row['volume_issue'];
    $journal_title = $row['journal_title'];
    $html2 = file_get_html($url);
    $html2 = file_get_html($url);

    foreach ($html2->find('table[class="tocArticle"]') as $div2) {
        $DOM = new DOMDocument();
        $DOM->loadHTML($div2);
        $Detail = $DOM->getElementsByTagName('td');

        $i = 0;
        $j = 0;
        foreach ($Detail as $sNodeDetail) {
            $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
            $j = $i % count($aDataTableDetailHTML[$j]) == 0 ? $j + 1 : $j;
        }
    }
}

$result = [];
foreach ($aDataTableDetailHTML as $key => $value) {
    foreach ($value as $key1 => $value1) {
        echo '<pre>';
        $result[$key1][] = $value1;
    }
}
array_walk($result, function(&$item) {
    $item = implode('|', $item);
});

foreach ($result as $ref) {
    $newref[] = explode('|', $ref);
}

print_r($newref);

foreach ($newref as $key => $values) {
    $title = $values[0];

    $authors = str_replace(',            ',', ',$values[2]);

    $pages = $values[3];
    $doi = $values[4];
    $query2 = "INSERT INTO citations (title,authors,journal_title,volume_issue,pages,doi) VALUES (:title,:authors,:journal_title,:volume_issue,:pages,:doi)";
    $sql2 = $con->prepare($query2);
    $sql2->execute(array(
        ':title' => $title,
        ':authors' => $authors,
        ':journal_title' => $journal_title,
        ':volume_issue' => $volume_issue,
        ':pages' => $pages,
        ':doi' => $doi,
    ));
}

My problem is that this code is mot inserting journal title and volume_issue correctly as desired. How can i make this code insert the right volume issue for the right citation? I am on my wits end.

Recommended Answers

All 10 Replies

How i wish Daniweb had an editing tool to allow editing of questions.

What if i turned line 7 , 9 and 10 into arrays and combined them to form a multidimensional arary? e.g.,

    $volume_issue_id[] = $row['id'];
    $volume_issue[] = $row['volume_issue'];
    $journal_title[] = $row['journal_title'];

    $citations = array($volume_issue_id,$volume_issue,journal_title);

This creates a multidimensional array named citations that looks like this:

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 2
        )

    [1] => Array`Inline Code Example Here`
        (
            [0] => Vol 19, No 2 (2018)
            [1] => Vol 19, No 1 (2017)
        )

    [2] => Array
        (
            [0] => Africa Sanguine
            [1] => Africa Sanguine
        )

    [3] => Array
        (
            [0] => https://www.ajol.info/index.php/asan/issue/view/17048
            [1] => https://www.ajol.info/index.php/asan/issue/view/16693
        )

)

How do i combine this array with the array $newrefs so that i get citations for each volume issue?

On the other hand,does the problem lie with the mysql query above, because if i run the code manually for id 1 and later for id 2the citations are inserted as desired with corresponding volume_issuesunlike when i run the mysql query for both rows..

Any comments and/or ideas are highly appreciated

I can't help with the question, but can let you know there's an edit window of 30 minutes where you can change your post. After that, if it's something material to the question that can't be dealt with in a comment then you can always PM one of the admins or mods and we'll edit for you if deemed necessary. Hope that helps a bit for future endeavours.

Still stuck here. Desperate for suggestions not necessarily answers

Probably not the most efficient way, but if it is working for individual id values then enclose your code in a loop that iterates over the range of values and change your SQL statement from using BETWEEN to = , like so:

$start = 1;
$end = 2;
for ($tmpID = $start; $tmpID <= $end; $tmpID++) {

$query1 = "SELECT * FROM volume_issue where id = {$tmpID}";
$sql = $con->prepare($query1);
$sql->execute();
$sql->SetFetchMode(PDO::FETCH_ASSOC);
while ($row = $sql->fetch()) {
    $volume_issue_id = $row['id'];
    $url = $row['url'];
    $volume_issue = $row['volume_issue'];
    $journal_title = $row['journal_title'];
    $html2 = file_get_html($url);
    $html2 = file_get_html($url);
    foreach ($html2->find('table[class="tocArticle"]') as $div2) {
        $DOM = new DOMDocument();
        $DOM->loadHTML($div2);
        $Detail = $DOM->getElementsByTagName('td');
        $i = 0;
        $j = 0;
        foreach ($Detail as $sNodeDetail) {
            $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
            $j = $i % count($aDataTableDetailHTML[$j]) == 0 ? $j + 1 : $j;
        }
    }
}
$result = [];
foreach ($aDataTableDetailHTML as $key => $value) {
    foreach ($value as $key1 => $value1) {
        echo '<pre>';
        $result[$key1][] = $value1;
    }
}
array_walk($result, function(&$item) {
    $item = implode('|', $item);
});
foreach ($result as $ref) {
    $newref[] = explode('|', $ref);
}
print_r($newref);
foreach ($newref as $key => $values) {
    $title = $values[0];
    $authors = str_replace(',            ',', ',$values[2]);
    $pages = $values[3];
    $doi = $values[4];
    $query2 = "INSERT INTO citations (title,authors,journal_title,volume_issue,pages,doi) VALUES (:title,:authors,:journal_title,:volume_issue,:pages,:doi)";
    $sql2 = $con->prepare($query2);
    $sql2->execute(array(
        ':title' => $title,
        ':authors' => $authors,
        ':journal_title' => $journal_title,
        ':volume_issue' => $volume_issue,
        ':pages' => $pages,
        ':doi' => $doi,
    ));
}

}

Ok, here's my take. I don't want to "give a solution", but a few ideas...

<?php
/*
 * To use simple_html_dom on remote sites...
 * Remember to remove offset parameter on line 76 (or alternatively set its default value to 0):
 * FROM
 * $contents = file_get_contents($url, $use_include_path, $context, offset);
 * TO
 * $contents = file_get_contents($url, $use_include_path, $context);
 *
 */
require_once 'simple_html_dom.php';

function getHTML( $id, $url )
{
    $scrape = file_get_html( $url );
    $data = [];
    foreach ($scrape->find('table[class="tocArticle"] tbody') as $html) {
        $data[] = $id;
        //title
        $data[]  = $html->children(0)->children(0)->plaintext;
        //assuming you want download url of document of interest (DOI??)
        $data[] = $html->children(0)->children(1)->children(1)->href;
        //authors
        $data[] = preg_replace('# {2,}#', ' ', $html->children(1)->children(0)->plaintext );
        //pages
        $data[] = $html->chidren(1)->children(1)->plaintext;
    }
    $html->clear();
    unset($html);
    return $data;
}

function getRecords( $con, $first, $last )
{
    $query1 = "SELECT id, url FROM volume_issue where id between :first and :last";
    $sql = $con->prepare($query1);
    $sql->execute([':first' => $first, ':last' => $last]);
    $insert = [];
    while ($row = $sql->fetch(PDO::FETCH_ASSOC)) {
        //Careful this function may cause script to time out if many iterations
        $insert[] = getHTML($row['id'], $row['url']);
    }
    return $insert;
}

//Inserts citations to Citations table with structure...
// id | vi_id (FK on volume_issue table) | title | authors | doi | pages
// Notice no repeated data from first table - this is not required in a relational setting
function addCitations( $con, $valueArray )
{
    $records = count( $valueArray )/5;
    $values = rtrim( str_repeat('(?,?,?,?,?),', $records ), ',' );
    $query2 = "INSERT INTO citations (vi_id, title, authors, doi, pages) VALUES $values";
    $sql = $con->prepare( $query2 );
    return $sql->execute( $valueArray );
}

//Basic run code...
//Do basic PDO creation for $con
$insertArray = getRecords( $con, 1, 2 );
if(!empty($insertArray)){
    if(addCitations( $con, $insertArray )){
        echo "Success";
    }else{
        echo "Failed";
    }
}

You'll notice the different structure to the citations table - not sure if this is what you need.
Using simple_dom solely, not using DOMDocument.
I've chunked up the code into a few clunky functions for clarity - OOP may be even better, depending on the rest of your code.
There is no error checking or type checking - you really need this.
IMPT - This is not tested in its entirety

//EDIT
Getting html from remote sites can be very time consuming, so this script could time out really easily.

Thanks @rbjolly and @alan.davies. I am studying your suggestions and will get back here soon

I have just tested both solutions and here are the results.

  1. The first suggestion by @rbjolly does not produce desired results rather it loops twice and insert records twice in the table.

2.The second suggestion by @alan.davies returns a fatal error

Fatal error: Uncaught Error: Call to undefined method simple_html_dom_node::chidren() in /var/www/parser16.php:26 Stack trace: #0 /var/www/parser16.php(41): getHTML('1', 'https://www.ajo...') #1 /var/www/parser16.php(60): getRecords(Object(PDO), 1, 2) #2 {main} thrown in /var/www/parser16.php on line 26

I have also noticed that when i do a print_r on line 18 and line 24 i get the value of the first record id and the first record author respectively. A print_r on line 20 and line 26 returns a blank.

Hi gurjit. As I said it was not tested in its entirety. I will have another look at it soon.

OK, a few errors - hopefully sorted.
There was a need to flatten the data from the db so there's a new "flatten" function.
Also a typo chidren to children
And the DOI link should have the children index = 2 NOT 1!
Also need to swap the fields around in the loop, so that they the same as the INSERT query.

Here goes...

<?php
/*
 * To use simple_html_dom on remote sites...
 * Remember to remove offset parameter on line 76 (or alternatively set its default value to 0):
 * FROM
 * $contents = file_get_contents($url, $use_include_path, $context, offset);
 * TO
 * $contents = file_get_contents($url, $use_include_path, $context);
 *
 */
require_once 'simple_html_dom.php';

function getHTML( $id, $url )
{
    $scrape = file_get_html( $url );
    $data = [];
    foreach ($scrape->find('table[class="tocArticle"] tbody') as $html) {
        $data[] = $id;
        //title
        $data[]  = $html->children(0)->children(0)->plaintext;
        //authors
        $data[] = preg_replace('# {2,}#', ' ', $html->children(1)->children(0)->plaintext );
        //assuming you want download url of document of interest (DOI??)
        $data[] = $html->children(0)->children(1)->children(2)->href;
        //pages
        $data[] = $html->children(1)->children(1)->plaintext;
    }
    $html->clear();
    unset($html);
    return $data;
}

function flatten( array $array ) {
    $return = array();
    array_walk_recursive($array, function($a) use (&$return) { $return[] = $a; } );
    return $return;
}

function getRecords( $con, $first, $last )
{
    $query1 = "SELECT id, url FROM volume_issue where id between :first and :last";
    $sql = $con->prepare($query1);
    $sql->execute([':first' => $first, ':last' => $last]);
    $insert = [];
    while ($row = $sql->fetch(PDO::FETCH_ASSOC)) {
        //Careful this function may cause script to time out if many iterations
        $insert[] = getHTML($row['id'], $row['url']);
    }
    return flatten( $insert );
}

//Inserts citations to Citations table with structure...
// id | vi_id (FK on volume_issue table) | title | authors | doi | pages
// Notice no repeated data from first table - this is not required in a relational setting
function addCitations( $con, $valueArray )
{
    $records = count( $valueArray )/5;
    $values = rtrim( str_repeat('(?,?,?,?,?),', $records ), ',' );
    $query2 = "INSERT INTO citations (vi_id, title, authors, doi, pages) VALUES $values";
    $sql = $con->prepare( $query2 );
    print_r($valueArray);
    return $sql->execute( $valueArray );
}
//Basic run code...
//Do basic PDO creation for $con
$dsn = 'mysql:dbname=dw;host=localhost';
$user = 'xxxxx';
$password = 'xxxxx';

try {
    $con = new PDO($dsn, $user, $password);
} catch (PDOException $e) {
    echo 'Connection failed: ' . $e->getMessage();
}

$insertArray = getRecords( $con, 1, 2 );

if(!empty($insertArray)){
    if(addCitations( $con, $insertArray )){
        echo "Success";
    }else{
        echo "Failed";
    }
}

This worked for me...

snip0.PNG

snip1.PNG

commented: I have no clue what's going on here, but glad have some PHP expertise back here on the forum, Diafol :) +7

Your solution works like magic @alan.davies. Quite complex and advanced thinking/Php skills you have got. Thank you very much.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.