What should be the regular expression for thid code ???
<span style="text-align:justify;">Stepping out of the Shadows:M6 Once touched... Marisa Somerville has changed. Now a confi dent, groomed, successful businesswoman, in some ways she's nothing like the scared wife of an abusive husband that Rafe Peveril survived a plane crash with six years ago.<A HREF="">

Between <span> and <A HREF> there is a string !

Please help!

Recommended Answers

All 30 Replies

Member Avatar for diafol

What do you need to capture?

I need to retrieve those strings from a webpage where there are like 10 of them .... and store to database .....
That is why I need a regular expression for that ... !

preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF="">)/i', $subject, $result, PREG_PATTERN_ORDER);
Member Avatar for diafol

Was about to offer this:

'/(<span.*>)(.*)(<A.*>)/i'

But Prit beat me to it. He's far better than me at regex - so his version is probably safest.

Member Avatar for diafol

I've never seen that function. Does regex() exist?

Does regex() exist?

Not to my knowledge. Read up on preg_match_all.

Oh! That was my fault!

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF="">)/i', $content, $result, PREG_PATTERN_ORDER);
    echo $result;
?>

I have this code .......
How do I print all the matches on the web page ??

See example #2

Which example are you talking about ?

I edited that part ......
but get the following error ....

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

What should be done ?

I added this to the code ...

.
.
.
foreach ($result as $val) {
        echo $val[0];
        echo $val[1];
    }
?>

Hi,

What are you trying accomplished? Parse the books with their, description, thumbnails.?.

Exactly .....
I want to extract their title and description mainly ...... ?

Member Avatar for diafol

just do a print_r($result);for now to see if anything is returned. You may need to modify your regex

I changed the regex a bit ....

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/i', $content, $result, PREG_PATTERN_ORDER);
    //echo $result;
    /*foreach ($result as $val) {
        echo $val[0];
        echo $val[1];
    }*/
    print_r($result);
?>

But I got this result!
Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/is', $content, $result, PREG_PATTERN_ORDER);

Note the s modifier (dot includes newlines).

I guess it now works...... !
but why is there so much
Array ( [0] => Array ( [0] =>
in between the result ..... ?

Because you specified three capturing groups most likely.

Member Avatar for diafol
preg_match_all('/<span style="text-align:justify;">(.*)<A REF=/is', $txt, $result, PREG_PATTERN_ORDER);

should just give you two results:

$result[0][0] will be the whole thing
$result[1][0] will be the captured group - the stuff within the brackets

Hi,

I don't know if this will help, but I will post it here anyway for whatever it's worth take it.. I could have written this much better, but I am pretty busy right now to experiment with a much better approach.

Step One: Save this file as dom.php

Step Two: Copy , save codes below and upload to your server

<?php
include 'dom.php';

## your function as derived from your posted question on daniweb
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml); 

foreach($html->find('table',9)as $item) {

 ## this is general and pretty broad output
    echo $item;

    }


//print_r($outhtml);

?>

The above script should give a pretty broad result.. Meaning, shopping cart buttons, and other stuffs on the page maybe present as artifacts. It might even have an array array artifacts.. but those can be easily fix later on.

Step Three: Actually, this is NO STEP at all, but rather a modified approach of the above. Since the above is pretty general in the sense of parsing the remoter page, this one will target the 'a' tag..of the page. This will be able to parse the title , thumbnail, alt. I am really hoping it would... :)

<?php
## include the dom.php
include 'dom.php';
## define your target url
$url = 'http://books.rediff.com/categories/fiction-genres/2180204';

## your function as derived from your posted question on daniwed
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




    $outhtml = get_url_contents($url);
    $html= str_get_html($outhtml); 


foreach($html->find('a')as $link){

    //echo $link;
    foreach($link->find('img')as $value){
     ## filter extension.. we know books thumbnails are more likely to be jpg

     echo '<img src="'.$value->src.'"/><br/>';
        echo $value->title."<br/>";

    }
}



//print_r($outhtml);

?>

Once again, you can try the second script I have provided above, and see if this will help on what you are trying to achieved.

My only worries on the second script is that "$link->find('img') is pretty general and it will pick up anything with <img> tags.

If you ever encounter this problem, I strongly suggest to take advantage of what Ardav and Pritaeas already contributed on this thread to isolate only the jpg images..

I tried this piece of code ...
for extracting string of type ....
<span style="font:normal 18px arial;color:#2E8BC3;">A Bride For The Island Prince</span>

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/is', $content, $result, PREG_PATTERN_ORDER);
    preg_match_all('/(<span style="font:normal 18px arial;color:#2E8BC3;">)(.*?)(</span>)/is', $content, $result1, PREG_PATTERN_ORDER);
    foreach ($result1 as $val) {
        echo $val[0];
    }
    print_r($result);
?>

But I get the following error .....
Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'p' in E:\wamp\www\Crawl\crawl.php on line 18

@veedeoo
Thank you very much for your help !!!
I tried your code but
Fatal error: Call to undefined function str_get_html() in E:\wamp\www\Crawl\crawl1.php on line 17
This shows up in both the scripts !!!

Hi,

change this

 $html= str_get_html($outhtml); 

With this...

 $html = file_get_html($url);

The output should give you something.. YOu just need to use the regex that Pritaes have given you to sort out the jpg from gif, to eliminate the shopping cart images and the search..

Let me know... what is the outcome though...

<?php
include 'dom.php';

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= file_get_html($url); 

foreach($html->find('table',9)as $item) {
    echo $item;
}
//print_r($outhtml);

?>

This code is still giving the similar fatal error !

Hi,

I just uploaded the above scripts on my server and it is working fine.

Script TWO, I modified it a little to eliminate the gif image from being picked up by the script.
Script TWO

Script ONE, this si the generalized parsed version, where you see lots of artifacts in it. I prefer the version two of the script.
Script One

The inability of your localhost to execute cURL might have been due to your settings. Make sure the server can access the internet to get the $url.

Could you please tell me what changes you did in the script ?

What settings are you talking about ..... ? How can I change them .... ?

Here is the modified version of script TWO. This will only work on live server or XAMPP portable or full install. Again it all depends on your Internet connection speed. Live servers connections are extremely fast, that is why it is able to pull the entire html page of the target url.

<?php
## include the dom.php
include 'dom.php';
## define your target url
$url = 'http://books.rediff.com/categories/fiction-genres/2180204';

## your function as derived from your posted question on daniwed
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




    $outhtml = get_url_contents($url);
    $html= str_get_html($outhtml); 


foreach($html->find('a')as $link){

    //echo $link;
    foreach($link->find('img')as $value){
     ## filter extension.. we know books thumbnails are more likely to be jpg
     $ext = substr(strrchr($value->src, '.'), 1);

     ## we don't want gif images, we only want the book thumbnails.
     if($ext != 'gif'){
     echo '<img src="'.$value->src.'"/><br/>';
        echo html_entity_decode($value->title)."<br/>";

        }
        //echo $ext."<br/>";

    }
}



//print_r($outhtml);

?>

If I get extra time later on, or my lazyness to get back on my project does not subside, I will modify this script to be able to send search query to rediff.com . That's pretty easy than the project I am currently working on right now.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.