Retrieve html/DOM using Selenium/PHPUnit and PHP/cURL

Question

PsychicTide 39 Junior Poster

10 Years Ago

Hey guys, this seems to be a thorn in my side. I've been working on scraping a website which uses aspx and has eventvalidation/viewstate inputs. Every other scraping experiment I've made was not this difficult. Maybe one of you geniuses here at Daniweb has an idea of how to solve this?

I've managed to get Selenium/PHPUnit to automate the process of opening the browser, typing in the URL, filling out the required fields (todays date), and then landing on the page I need to scrape and can get the viewstate and eventvalidation values from any of those pages.

I've been researching for hours some way to scrape the resulting page and have come up with several (useless?) ideas... Find a function in PHP which can scrape the currently active (fully loaded) page, Attempt to use xml/javascript to then make an XMLhttprequest, and a few other random ones (none of which have worked correctly so far, obviously).

I'm now trying to figure out how cURL works to maybe 'emulate' a live user, but I have no idea how the structuring works for my specific example. I can use firefox/firebug and look at the network tab, which shows the request headers and request body. I can even right-click this event and 'copy as cURL', but I have no idea what to do with any of these values. it appears the values I need are __VIEWSTATE / __EVENTVALIDATION (which I have put into vairables), the current date twice (txtStartDate and txtThruDate), and btnSearch=search... as far as I know thats it... Once I get the html in DOM form I already have to code to scrape it using Simple HTML DOM Parser. I've looked at this link which seems pretty close to what I need, but am not sure how to format it http://stackoverflow.com/questions/15337197/trying-to-connect-to-aspx-site-using-curl (specifically the first responder is what seems like the possible correct way of looking at it?).

If anyone has any idea what I'm talking about or want me to clarify anything, please let me know! Lost a lot of hair over this one.

asp.net html-css php

Edited 10 Years Ago by PsychicTide

1 Contributor
2 Replies
907 Views
3 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by PsychicTide

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

PsychicTide 39 Junior Poster · Answer 1 · 2015-04-20T05:41:19+00:00

I've tried several things based off of what I'm working on now (add or remove things, try different functions in replacement)... Anyone know if I'm anywhere close?

$url = "http://XXX.XXX.XXX.XX/search.aspx";
    $dateCheck = $dateParam = date('m').'%2F'.date('d').'%2F'.date('Y');
    $ch = curl_init($url);
    $ckfile = tempnam("/tmp", "CURLCOOKIE");
    $useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0';
    curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    //curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    //curl_setopt($ch, CURLOPT_USERAGENT, $useragent);

//Return the response as a string instead of outputting it to the screen
//curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
//$result = curl_exec($ch);
//Check for errors
    if (curl_errno($ch))
    {
        $result = 'cURL ERROR -> ' . curl_errno($ch) . ': ' . curl_error($ch);
        die(curl_error($ch));
    }
    else
    {
        $returnCode = (int)curl_getinfo($ch, CURLINFO_HTTP_CODE);
        switch($returnCode)
        {
            case 200:
                break;
            default:
                $result = 'HTTP ERROR -> ' . $returnCode;
                break;
        }
    }
   $options = array(
        CURLOPT_RETURNTRANSFER => true, // return web page
        CURLOPT_HEADER => true, // don't return headers
        CURLOPT_FOLLOWLOCATION => true, // follow redirects
        CURLOPT_ENCODING => "", // handle all encodings
        CURLOPT_USERAGENT => $useragent, // who am i
        CURLOPT_AUTOREFERER => true, // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
        CURLOPT_TIMEOUT => 120, // timeout on response
        CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => '__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE='.$viewState.'&__EVENTVALIDATION='.$validation.'&txtStartDate='.$dateCheck.'&txtThruDate='.$dateCheck.'&btnSearch=Search');
        $ch = curl_init( $url );
    curl_setopt_array( $ch, $options );
    $result = curl_exec ($ch);

Seems to return a 500 server runtime error... no custom error reporting or hints (I know this by sending the result to a text file).

PsychicTide 39 Junior Poster · Answer 2 · 2015-04-20T06:14:36+00:00

This is what is printed when I right click the POST under the network tab 'copy as cURL' with dev tools up...

curl "http://XXX.XXX.XXX.XX/search.aspx" -H
"Host: XXX.XXX.XXX.XX" -H
"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0" -H
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" -H
"Accept-Language: en-US,en;q=0.5" --compressed -H
"Referer: http://XXX.XXX.XXX.XX/search.aspx" -H
"Connection: keep-alive" --data
"__VIEWSTATE=string&__EVENTVALIDATION=string&txtStartDate=04"%"2F20"%"2F2015&txtThruDate=04"%"2F20"%"2F2015&btnSearch=Search"

How I get the viewstate and eventvalidation (these seems to be correct when echo'd)...

protected function setUp()
  {
    $this->setBrowser("chrome");
    $this->setBrowserUrl("http://XXX.XXX.XXX.XX/");
  }

...

public function testMyTestCase()
{
    $dateParam = date('m').'/'.date('d').'/'.date('Y');
    $this->open("/search.aspx");
    $this->type("id=txtStartDate", $dateParam);
    $this->type("id=txtThruDate", $dateParam);
    $this->click("id=btnSearch");
    $this->waitForPageToLoad("30000");
    $viewState = $this->getAttribute("css=input[name='__VIEWSTATE']@value");
    $validation = $this->getAttribute("css=input[name='__EVENTVALIDATION']@value");
    $url = "http://XXX.XXX.XXX.XX/search.aspx";
    $dateCheck = $dateParam;
    $ch = curl_init($url);
    ...
    Leads into the first code segment I posted...

Note: also fixed the dateCheck variable to correspond with dateParam like it should have