cURL won't scrape this page. Why?

Question

Ryujin 27 Newbie Poster

10 Years Ago

Greetings. Trying to scrape data from search results in a library catalog, but cannot return anything at all. The same script below works fine pulling from another catalog, but not with this one. (It's a Voyager catalog by ExLibris, in case that helps.)

Below for simplicity is a boiled-down version of the script, with all scraping functions removed. The script runs on this page.

As you might already know, lots of library catalogs generate session URLs. But that is not the issue in this case. The script won't even scrape the URL of the catalog's 'home page,' the first link above.

Is there a way to diagnose what the catalog server is sending that prevents returning its HTML? And then to properly set a CURLOPT to overcome that?

Thank you for your thoughts!

<?php    
    function curl($url) {
         $options = Array(
            CURLOPT_RETURNTRANSFER  => TRUE,   
            CURLOPT_FOLLOWLOCATION  => TRUE,   
            CURLOPT_AUTOREFERER     => TRUE,  
            CURLOPT_CONNECTTIMEOUT  => 90,    
            CURLOPT_TIMEOUT         => 90,   
            CURLOPT_MAXREDIRS       => 10,  
            CURLOPT_URL             => $url,  
            CURLOPT_HEADER         => false,         
            CURLOPT_ENCODING       => "",            
            CURLOPT_USERAGENT      => "'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13')",    
            CURLOPT_POST           => 1,           
            CURLOPT_POSTFIELDS     => $curl_data,     
            CURLOPT_SSL_VERIFYHOST => 0,             
            CURLOPT_SSL_VERIFYPEER => false,     
            CURLOPT_VERBOSE        => 1             
        );

        $ch = curl_init();   
        curl_setopt_array($ch, $options);    
        $data = curl_exec($ch);  
        curl_close($ch);    
        return $data;    
    }
        //SETS UP A (STABLE) URL OF A SEARCH RESULTS PAGE:
        $DDCnumber = 873;
        $url = "http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi?DB=local&CNT=90&Search_Arg=" . $DDCnumber . "&Search_Code=CALL%2B&submit.x=23&submit.y=23"; 
          echo "The URL we'd like to scrape is " . $url . "<br />";      
        $results_page = curl($url);

      if ($results_page != "")    {echo "Something was retrieved"; }

    ?>

php

3 Contributors
21 Replies
981 Views
2 Weeks Discussion Span
Latest Post 10 Years Ago Latest Post by cereal

diafol

10 Years Ago

You could also try file_get_contents() using a stream_context_create(). I did notice that the search on one of your links was exceptionally slow.

//EDIT:

$data = array(
    'DB'            => 'local',
    'CNT'           => 90,
    'Search_Arg'    => 873,
    'Search_Code'   => 'CALL',
    'submit.x'      => 23,
    'submit.y'      => 23
    );

$postdata = http_build_query($data);

$options = array('http' =>
    array(
        'method'  => 'POST',
        'header'  => 'Content-type: application/x-www-form-urlencoded',
        'content' => $postdata
    )
);

$context  = stream_context_create($options);

$result = file_get_contents('http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First', false, $context);

echo $result;

Seemed to work for me.

Edited 10 Years Ago by diafol

cereal 1,524 Nearly a Senior Poster

10 Years Ago

@Ryujin

In addition to diafol suggestion :)

There is no specific rule, you have to try the forms to understand the functionality. In the vufind case the first problem is given by the $url, you're targetting the form page, but the action of the form is pointing to another page:

<form method="get" action="/vufind/Search/Results" id="advSearchForm" name="searchForm" class="search">

so to get results $url must be:

https://vf-kutz.klnpa.org/vufind/Search/Results

The method must be GET, because they could verify the request method also, since this is an advanced search form, the receiving script is going to check for more variables, for example, by searching math & euler the minimun query string to send is this:

$data = array(

    'join'              => 'AND',
    'bool0'             => array(
        'AND'
        ),

    'lookfor0'          => array(
        'math',
        'euler',
        ''
        ),

    'type0'             => array(
        'AllFields',
        'AllFields',
        'AllFields'
        ),

    'sort'              => 'relevance',
    'submit'            => 'Find',
    'illustration'      => -1,
    'daterange'         => array(
        'publishDate'
        ),

    'publishDatefrom'   => '',
    'publishDateto'     => ''

);

$body = http_build_query($data);
$url  = "https://vf-kutz.klnpa.org/vufind/Search/Results?".$body;
$results_page = curl($url);

Note lookfor0[], bool0[], type0[], daterange[]: are all arrays for example lookfor0[] can be declared multiple times, but there are some conditions: you have to match each with $type0[]. The above can be rewritten like this:

$data = array(

    'join'              => 'AND',
    'bool0[]'           => 'AND',
    'lookfor0[0]'       => 'math',
    'lookfor0[1]'       => 'euler',
    'lookfor0[2]'       => '',
    'type0[0]'          => 'AllFields',
    'type0[1]'          => 'AllFields',
    'type0[2]'          => 'AllFields',
    'sort'              => 'relevance',
    'submit'            => 'Find',
    'illustration'      => -1,
    'daterange[]'       => 'publishDate',
    'publishDatefrom'   => '',
    'publishDateto'     => ''

);

If you're using this last syntax then when using http_build_query() it's important to add the numbers (i.e. lookfor0[2]) otherwise the function will remove the duplicate keys.

In practice in the vufind case you have to look at the results page, check how the links are formed and you can understand how to pull out results, it's in this page that the queries are made.

Hope it helps.

Edited 10 Years Ago by cereal

diafol commented: sorry c, heh heh, didn't mean to wander off in another direction +15

diafol

10 Years Ago

Has the site blocked your IP?

cereal 1,524 Nearly a Senior Poster

10 Years Ago

Ok, here you can see the code and play the example:

http://runnable.com/VG_QpOHSMpE82e3u/search-with-curl

Just click on run and then submit the form.

diafol commented: Thanks for the url - new site on me - runnable. Brilliant +15

diafol

10 Years Ago

My thoughts were that you've probably been making loads of curl calls and the site may have picked up on this in their logs from a massive jump in traffic. Possibly they thought - "aha malicious b#stard" and blocked you.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sorry c, heh heh, didn't mean to wander off in another direction

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 1 · 2014-11-15T09:33:22+00:00

The problem here is that you are defining the array of options for curl without including the POST data:

CURLOPT_POSTFIELDS     => $curl_data,

$curl_data is not defined anywhere, and you're including the input in the url so this becomes a post request with an empty body and the variables in the GET array. What you can do is to prepare the body like this:

$DDCnumber = 873;
$url = "http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi";

$data = array(
    'DB'            => 'local',
    'CNT'           => 90,
    'Search_Arg'    => $DDCnumber,
    'Search_Code'   => 'CALL',
    'submit.x'      => 23,
    'submit.y'      => 23
    );

$body = http_build_query($data);
$results_page = curl($url, $body);

And add an argument to the function:

function curl($url, $curl_data) {

then it should work fine.

Ryujin 27 Newbie Poster · Answer 2 · 2014-11-17T22:03:21+00:00

Thank you for that, but there is something else going on. Note that the same script works with one catalog, but not with another. Here is a demo; it makes use of the less accurate catalog.

Boiled down to the bones, with no search strings: The scripts below are identical, but the 1st brings back nothing while the 2nd retrieves the target page.

The code below is at this page:

<?php    

    function curl($url) {

        $options = Array(
            CURLOPT_RETURNTRANSFER => TRUE,   
            CURLOPT_FOLLOWLOCATION => TRUE,   
            CURLOPT_AUTOREFERER => TRUE,  
            CURLOPT_CONNECTTIMEOUT => 120,    
            CURLOPT_TIMEOUT => 120,   
            CURLOPT_MAXREDIRS => 10,  
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
            CURLOPT_URL => $url,  
        );

        $ch = curl_init();   
        curl_setopt_array($ch, $options);   
        $data = curl_exec($ch);  
        curl_close($ch);     
        return $data;    
    }

       $url = "http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First";   
     $results_page = curl($url); 

     echo $results_page;     

    ?>

The code below is at this page:

<?php    

function curl($url) {

        $options = Array(
            CURLOPT_RETURNTRANSFER => TRUE, 
            CURLOPT_FOLLOWLOCATION => TRUE, 
            CURLOPT_AUTOREFERER => TRUE, 
            CURLOPT_CONNECTTIMEOUT => 120,  
            CURLOPT_TIMEOUT => 120,  
            CURLOPT_MAXREDIRS => 10,  
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  
            CURLOPT_URL => $url, 
        );

        $ch = curl_init();  
        curl_setopt_array($ch, $options);  
        $data = curl_exec($ch); 
        curl_close($ch);   
        return $data;   
    }

     $url = "https://vf-kutz.klnpa.org/vufind/Search/Advanced";  
    $results_page = curl($url);  

     echo $results_page;         

    ?>

cereal, I do appreciate your pointing out my error with $curl_data, which was very instructive for me. I implemented the fix you showed me but to no avail. The questions remain: What is it that makes the one target so different from the other, and can it be overcome?

Ryujin 27 Newbie Poster · Answer 3 · 2014-11-19T22:31:30+00:00

Thanks again folks. Diafol, now I'm feeling extra-clueless because the script you posted and said, Seemed to work for me, it doesn't work for me unless I switch out the pilot.passhe.edu URL for something else. (Then, it works.) Is there a test page where you can show me that in action with Pilot?

Ryujin 27 Newbie Poster · Answer 4 · 2014-11-21T15:51:03+00:00

Hi, no it hasn't.
In briefest of terms: what I'm asking is, what is there about http://pilot.passhe.edu:8042/ ... that makes it different from other sites that cURL readily scrapes? (Or are you guys saying that you were actually able to return something from http://pilot.passhe.edu:8042/ ?)

I'm so sorry for confusing the issue by starting from search results: anything there, even the bare initial search page itself, returns zero, as far as i can see! (Nonetheless, I have learned from your explicative code examples.)
Thanks ~

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 5 · 2014-11-21T18:17:16+00:00

(Or are you guys saying that you were actually able to return something from http://pilot.passhe.edu:8042/ ?)

(yes), it works fine for me, I'm attaching my full examples. If preferred I can paste all the code here.

what I'm asking is, what is there about http://pilot.passhe.edu:8042/ ... that makes it different from other sites that cURL readily scrapes?

Since it works, for me, there is no reason why this link is different from others.

Ryujin 27 Newbie Poster · Answer 6 · 2014-11-21T20:16:19+00:00

Thanks, @cereal -- Can you link to a tiny working example, that pulls from //pilot...? Because, for the life of me, I'm just not seeing it...

Ryujin 27 Newbie Poster · Answer 7 · 2014-11-22T16:09:10+00:00

That is absolutely awesome, cereal. Beautifully done. Thank you!
I'm eager to get home & start working with it.

Ryujin 27 Newbie Poster · Answer 8 · 2014-11-24T11:24:42+00:00

Brilliant of you to put it on Runnable, thank you so much.
When it still failed on my Bluehost server i did phpinfo() on both.
Runnable: PHP Version - 5.4.9-4ubuntu2.3; cURL Information - 7.29.0
Bluehost: PHP Version - 5.2.17; cURL Information - libcurl/7.24.0

So i'm guessing this explains the difference?

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 9 · 2014-11-24T12:06:12+00:00

I think your setup should support this script.

Check the PHP error log on Bluehost and, if this doesn't help, then check the access & error log of Apache (or of the web server in use), from there you can understand if the problem is PHP or, as suggested, if the access is failing from your server.

A simple test you can do:

<?php

    $url = "http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi";
    print_r(get_headers($url));

Should return something like this:

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 24 Nov 2014 12:01:11 GMT
    [2] => Server: Apache
    [3] => Connection: close
    [4] => Content-Type: text/html
)

At least you make sure it's accessible from your server.

Ryujin 27 Newbie Poster · Answer 10 · 2014-11-25T17:17:39+00:00

Ah, so as diafol suspected, then, there's something that prevents my server from accessing the Pilot one. (Right?)

Nothing relevant in the PHP error log. Here's the output of the test you suggested, cereal:

Warning: get_headers(http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi) [function.get-headers]: failed to open stream: Connection timed out in /kutzutil/DDCweed/testHeaders.php on line 9

Also when i turn on PHP error reporting & try to run the scripts on Pilot, gets only "couldn't connect to host" errors.

Thank you guys for patiently & rationally diagnosing this. What i still don't understand is why there'd be such an 'incompatibility' between those servers but not others.

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 11 · 2014-11-25T19:48:29+00:00

Ah, so as diafol suspected, then, there's something that prevents my server from accessing the Pilot one. (Right?)

Yes, the error of get_headers() helps a bit: failed to open stream: Connection timed out.

I suspect the problem is given by the DNS of Bluehost, which are:

ns1.bluehost.com
ns2.bluehost.com

By quering passhe.edu through the dig command we can see the domain is not resolved by their DNS servers and so it could not be reachable from your server:

dig @ns1.bluehost.com passhe.edu

; <<>> DiG 9.9.5-3-Ubuntu <<>> @ns1.bluehost.com passhe.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56885
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2800
;; QUESTION SECTION:
;passhe.edu.                    IN      A

;; Query time: 242 msec
;; SERVER: 74.220.195.31#53(74.220.195.31)
;; WHEN: Tue Nov 25 20:40:23 CET 2014
;; MSG SIZE  rcvd: 39

While by testing through Google DNS we can see it:

dig @8.8.8.8 passhe.edu

; <<>> DiG 9.9.5-3-Ubuntu <<>> @8.8.8.8 passhe.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20522
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;passhe.edu.                    IN      A

;; ANSWER SECTION:
passhe.edu.             899     IN      A       204.235.147.180

;; Query time: 247 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Nov 25 20:43:44 CET 2014
;; MSG SIZE  rcvd: 55

To be sure you can try to run this command from a session terminal in BlueHost:

dig passhe.edu ANY

If this is the problem, then open a ticket with BlueHost and ask them to fix it.

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 12 · 2014-11-25T20:36:55+00:00

Forgot to add

klnpa.org is reachable from BlueHost DNS, so it may be this the problem, here's the output:

dig @ns1.bluehost.com klnpa.org

; <<>> DiG 9.9.5-3-Ubuntu <<>> @ns1.bluehost.com klnpa.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32087
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2800
;; QUESTION SECTION:
;klnpa.org.                     IN      A

;; ANSWER SECTION:
klnpa.org.              60      IN      A       74.220.199.6

;; Query time: 210 msec
;; SERVER: 74.220.195.31#53(74.220.195.31)
;; WHEN: Tue Nov 25 21:26:39 CET 2014
;; MSG SIZE  rcvd: 54

As you can see there is an Answer Section with the IP address of klnpa.org. Depending on the configuration of your hosting plan, maybe you can add a forwarder DNS to match correctly all the domains, but as suggested in my previous post you may want to ask support to BlueHost.

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 13 · 2014-11-26T00:25:31+00:00

Sorry for my last post, I think it doesn't help much: I'm seeing some weird results from BlueHost DNS server, whatever you search through dig is redirected to the same IP address 74.220.199.6 which is parking.bluehost.com, it doesn't match passhe.edu because they do not redirect any .edu domain, some information here:

http://comments.gmane.org/gmane.network.dns.operations/3768

So I'm not anymore sure this can be related with your issue. For now, please, do not consider it.

Going back to your script: last check you can do is to search directly their IP:

<?php

    $url = "http://204.235.148.32:8042/cgi-bin/Pwebrecon.cgi";
    print_r(get_headers($url));

With the IP it should work fine and temporarly fix your script, but if the problem is with your hosting DNS then it doesn't solve your issue in case you try to perform a request to another edu domain, for example:

<?php

    $url = "https://search.library.cornell.edu/";
    print_r(get_headers($url));

Anyway, I would still try the dig commands from a terminal in your remote server to evaluate the output and ask if it's possible to enable a forwarding DNS. If the correct IP is displayed then it's not BlueHost but something else like diafol's suggestion.

Ryujin 27 Newbie Poster · Answer 14 · 2014-11-28T00:23:36+00:00

@diafol, that's a thought but no, since the few calls i've tried all failed from the beginning.

Have opened a ticket with Bluehost. @cereal, it's interesting because calls to other .edu sites do work; don't know if odd results here explain anything (page takes a long time to render as it waits for last two calls to time out).

i'll leave this discussion open till i can report what Bluehost says. But cereal, i'm in awe of the work you did to sort this out. When i hear from Bluehost (and hopefully what i'll hear is that they fixed it!) i will mark the question 'solved.'

Ryujin 27 Newbie Poster · Answer 15 · 2014-11-29T02:28:24+00:00

It turns out that the necessary port is closed. Response from Bluehost:

You cannot retrieve information from http://pilot.passhe.edu:8042/ Port 8042 is closed on your server. Running curl http://pilot.passhe.edu:8042/ on a server with it opened retrieves information. You have to log into cpanel and purchase a dedicated IP and contact us back to enable port 8042.

They have a knowlegebase article explaining why they block ports on shared IP, and how to get a dedicated one--basically a surcharge of abt $3/month.

Thank you again folks!

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 16 · 2014-11-29T10:00:56+00:00

Too bad I didn't read their docs! Thanks for sharing the solution. Bye ;D