Is there a way to load a page for a minute and then fetch content of a page.

I am using

new DOMDocument

method to fetch content from a page which is working fine. Problem is that i want to load a page for a minute because the page from which i am fetching data , shows result after 30 seconds and PHP DOMDocument fetch data before the whole page loads.
So is there a way

If the page loads data via AJAX this may prove impossible.

But i check it in source code, data is accessable to me and i can fetch it. All i need is to load a page for a minute and the fetch it. Is there a way

But i check it in source code, data is accessable to me and i can fetch it

If you already have the source code and the data, what do you need changed? Am confused.

source code via browser, not the real source code. But anyway tell me how can i run file_get_content() for a minute. Just tell me this forget everything.

My original reply still stands. If that data is being retrieved via Javascript, you can't get at it, even if you wait for a minute.

Member Avatar
diafol

OK, here's a demo - vanilla js replaced content works if the page is echoed, but as pritaeas states, ajaxed content does not. However, the DOMDocument will not pick up on javascript replacement.

demos.diafol.org/scrape/... files

jsreplace.php

<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Original content</p>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script>$('p').html('JS content');</script>
</body>
</html>

ajaxreplace.php

<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Original content</p>

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script>
    var ret = $.get('ajax.php',function(data)
    {
        $('p').html(data);  
    });

</script>
</body>
</html>

ajax.php

Ajaxed content

So you can try to scrape thus...

<?php echo file_get_contents('http://demos.diafol.org/scrape/jsreplace.php');?>

you should see 'JS content' - you may even see the replacement taking place with a quick flash of the 'Original content' text.

<?php echo file_get_contents('http://demos.diafol.org/scrape/ajaxreplace.php');?>

you should see 'Original content'

But the apparent 'success' from the JS replacement is just the JS running after the fact.

DOMDocument cannot foresee this replacement.

<?php
        $baseUrl = 'http://demos.diafol.org/scrape/jsreplace.php';
        $domDoc = new DOMDocument();
        $domDoc->strictErrorChecking = false;
        $domDoc->recover=true;
        @$domDoc->loadHTMLFile($baseUrl);

        $ps = $domDoc->getElementsByTagName('p');
        foreach($ps as $p)
            echo $p->nodeValue;
?>

You should see 'Original content'

Hope that clears it up.

In my experience, it's more effective scraping content (especially when it's complex) using a local program rather than PHP. This gives you the same vantage point as a user sitting in front of a browser so you see everything regardless of what technology was used to put it there. I have used Autoit for this because it has a good (COM) interface to the IE services built into Windows. When I need to use that data online, I have the local program upload the data to a custom PHP program that stores it in a DB. If this needs to be done on a scheduled basis you can set it up in the Windows scheduler so as long as your machine is on, it can be done hands-off and the data will be available for use online. Not a great solution if it needs to be done on-demand from multiple end points.

but local software saves data once or we have to run it again and again. I have a site which fetches data live from other site. I just want to load a page for couple of second and then fetch its data. file_get_contents() dont wait , it just fetches the data from page but i want to make it wait and fetch data after ajax is also loaded

and fetch data after ajax is also loaded

You can't (with PHP).

So what's the solution

There isn't one.

You can only do this when building a desktop application which uses a webbrowser control.