Problem in Regular expressions ...?

Question

apanimesh061 0 Junior Poster

13 Years Ago

What should be the regular expression for thid code ???
<span style="text-align:justify;">Stepping out of the Shadows:M6 Once touched... Marisa Somerville has changed. Now a confi dent, groomed, successful businesswoman, in some ways she's nothing like the scared wife of an abusive husband that Rafe Peveril survived a plane crash with six years ago.<A HREF="">

Between <span> and <A HREF> there is a string !

Please help!

php regex

4 Contributors
30 Replies
297 Views
2 Days Discussion Span
Latest Post 13 Years Ago Latest Post by apanimesh061

All 30 Replies

pritaeas 2,211 ¯\_(ツ)_/¯

13 Years Ago

preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF="">)/i', $subject, $result, PREG_PATTERN_ORDER);

Edited 13 Years Ago by pritaeas because: indent lost

pritaeas 2,211 ¯\_(ツ)_/¯

13 Years Ago

Does regex() exist?

Not to my knowledge. Read up on preg_match_all.

pritaeas 2,211 ¯\_(ツ)_/¯

13 Years Ago

See example #2

diafol

13 Years Ago

The 2nd example here: http://nl.php.net/manual/en/function.preg-match-all.php

veedeoo 474 Junior Poster

13 Years Ago

Hi,

change this

 $html= str_get_html($outhtml);

With this...

 $html = file_get_html($url);

The output should give you something.. YOu just need to use the regex that Pritaes have given you to sort out the jpg from gif, to eliminate the shopping cart images and the search..

Let me know... what is the outcome though...

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

diafol · Answer 1 · 2012-04-17T22:59:30+00:00

diafol

13 Years Ago

What do you need to capture?

apanimesh061 0 Junior Poster · Answer 2 · 2012-04-18T07:20:53+00:00

I need to retrieve those strings from a webpage where there are like 10 of them .... and store to database .....
That is why I need a regular expression for that ... !

diafol · Answer 3 · 2012-04-18T07:47:49+00:00

Was about to offer this:

'/(<span.*>)(.*)(<A.*>)/i'

But Prit beat me to it. He's far better than me at regex - so his version is probably safest.

apanimesh061 0 Junior Poster · Answer 4 · 2012-04-18T08:08:12+00:00

Is regex better or preg_match_all .... ?

$url = "http://books.rediff.com/categories/fiction-genres/2180204";
$content = getPageContent($url);
$synopsis = regex("/(<span style="text-align:justify;">)(.*?)(<A HREF)/i");

How should I use preg_match_all() here?

diafol · Answer 5 · 2012-04-18T08:11:03+00:00

I've never seen that function. Does regex() exist?

apanimesh061 0 Junior Poster · Answer 6 · 2012-04-18T13:49:21+00:00

Oh! That was my fault!

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF="">)/i', $content, $result, PREG_PATTERN_ORDER);
    echo $result;
?>

I have this code .......
How do I print all the matches on the web page ??

apanimesh061 0 Junior Poster · Answer 7 · 2012-04-18T14:11:09+00:00

apanimesh061 0 Junior Poster

13 Years Ago

Which example are you talking about ?

apanimesh061 0 Junior Poster · Answer 8 · 2012-04-18T14:52:21+00:00

I edited that part ......
but get the following error ....

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

Notice: Undefined offset: 0 in E:\wamp\www\crawl.php on line 19

Notice: Undefined offset: 1 in E:\wamp\www\crawl.php on line 20

What should be done ?

apanimesh061 0 Junior Poster · Answer 9 · 2012-04-18T14:53:29+00:00

I added this to the code ...

.
.
.
foreach ($result as $val) {
        echo $val[0];
        echo $val[1];
    }
?>

veedeoo 474 Junior Poster Featured Poster · Answer 10 · 2012-04-18T16:19:18+00:00

Hi,

What are you trying accomplished? Parse the books with their, description, thumbnails.?.

apanimesh061 0 Junior Poster · Answer 11 · 2012-04-18T16:59:09+00:00

Exactly .....
I want to extract their title and description mainly ...... ?

diafol · Answer 12 · 2012-04-18T17:41:54+00:00

just do a print_r($result);for now to see if anything is returned. You may need to modify your regex

apanimesh061 0 Junior Poster · Answer 13 · 2012-04-18T18:53:07+00:00

I changed the regex a bit ....

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/i', $content, $result, PREG_PATTERN_ORDER);
    //echo $result;
    /*foreach ($result as $val) {
        echo $val[0];
        echo $val[1];
    }*/
    print_r($result);
?>

But I got this result!
Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

pritaeas 2,211 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 14 · 2012-04-18T19:54:12+00:00

preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/is', $content, $result, PREG_PATTERN_ORDER);

Note the s modifier (dot includes newlines).

apanimesh061 0 Junior Poster · Answer 15 · 2012-04-18T20:13:12+00:00

I guess it now works...... !
but why is there so much
Array ( [0] => Array ( [0] =>
in between the result ..... ?

pritaeas 2,211 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 16 · 2012-04-18T21:02:55+00:00

pritaeas 2,211 ¯\_(ツ)_/¯

13 Years Ago

Because you specified three capturing groups most likely.

diafol · Answer 17 · 2012-04-18T21:12:16+00:00

preg_match_all('/<span style="text-align:justify;">(.*)<A REF=/is', $txt, $result, PREG_PATTERN_ORDER);

should just give you two results:

$result[0][0] will be the whole thing
$result[1][0] will be the captured group - the stuff within the brackets

veedeoo 474 Junior Poster Featured Poster · Answer 18 · 2012-04-18T21:38:45+00:00

Hi,

I don't know if this will help, but I will post it here anyway for whatever it's worth take it.. I could have written this much better, but I am pretty busy right now to experiment with a much better approach.

Step One: Save this file as dom.php

Step Two: Copy , save codes below and upload to your server

<?php
include 'dom.php';

## your function as derived from your posted question on daniweb
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml); 

foreach($html->find('table',9)as $item) {

 ## this is general and pretty broad output
    echo $item;

    }


//print_r($outhtml);

?>

The above script should give a pretty broad result.. Meaning, shopping cart buttons, and other stuffs on the page maybe present as artifacts. It might even have an array array artifacts.. but those can be easily fix later on.

Step Three: Actually, this is NO STEP at all, but rather a modified approach of the above. Since the above is pretty general in the sense of parsing the remoter page, this one will target the 'a' tag..of the page. This will be able to parse the title , thumbnail, alt. I am really hoping it would... :)

<?php
## include the dom.php
include 'dom.php';
## define your target url
$url = 'http://books.rediff.com/categories/fiction-genres/2180204';

## your function as derived from your posted question on daniwed
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




    $outhtml = get_url_contents($url);
    $html= str_get_html($outhtml); 


foreach($html->find('a')as $link){

    //echo $link;
    foreach($link->find('img')as $value){
     ## filter extension.. we know books thumbnails are more likely to be jpg

     echo '<img src="'.$value->src.'"/><br/>';
        echo $value->title."<br/>";

    }
}



//print_r($outhtml);

?>

Once again, you can try the second script I have provided above, and see if this will help on what you are trying to achieved.

My only worries on the second script is that "$link->find('img') is pretty general and it will pick up anything with <img> tags.

If you ever encounter this problem, I strongly suggest to take advantage of what Ardav and Pritaeas already contributed on this thread to isolate only the jpg images..

apanimesh061 0 Junior Poster · Answer 19 · 2012-04-19T18:09:18+00:00

I tried this piece of code ...
for extracting string of type ....
<span style="font:normal 18px arial;color:#2E8BC3;">A Bride For The Island Prince</span>

<?php

    function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    }

    $url = "http://books.rediff.com/categories/fiction-genres/2180204";
    $content = get_url_contents($url);
    preg_match_all('/(<span style="text-align:justify;">)(.*?)(<A HREF=)/is', $content, $result, PREG_PATTERN_ORDER);
    preg_match_all('/(<span style="font:normal 18px arial;color:#2E8BC3;">)(.*?)(</span>)/is', $content, $result1, PREG_PATTERN_ORDER);
    foreach ($result1 as $val) {
        echo $val[0];
    }
    print_r($result);
?>

But I get the following error .....
Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'p' in E:\wamp\www\Crawl\crawl.php on line 18

apanimesh061 0 Junior Poster · Answer 20 · 2012-04-19T18:22:09+00:00

@veedeoo
Thank you very much for your help !!!
I tried your code but
Fatal error: Call to undefined function str_get_html() in E:\wamp\www\Crawl\crawl1.php on line 17
This shows up in both the scripts !!!

apanimesh061 0 Junior Poster · Answer 21 · 2012-04-19T19:37:02+00:00

<?php
include 'dom.php';

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= file_get_html($url); 

foreach($html->find('table',9)as $item) {
    echo $item;
}
//print_r($outhtml);

?>

This code is still giving the similar fatal error !

veedeoo 474 Junior Poster Featured Poster · Answer 22 · 2012-04-19T20:09:41+00:00

Hi,

I just uploaded the above scripts on my server and it is working fine.

Script TWO, I modified it a little to eliminate the gif image from being picked up by the script.
Script TWO

Script ONE, this si the generalized parsed version, where you see lots of artifacts in it. I prefer the version two of the script.
Script One

The inability of your localhost to execute cURL might have been due to your settings. Make sure the server can access the internet to get the $url.

apanimesh061 0 Junior Poster · Answer 23 · 2012-04-19T20:14:22+00:00

Could you please tell me what changes you did in the script ?

What settings are you talking about ..... ? How can I change them .... ?

veedeoo 474 Junior Poster Featured Poster · Answer 24 · 2012-04-19T20:31:05+00:00

Here is the modified version of script TWO. This will only work on live server or XAMPP portable or full install. Again it all depends on your Internet connection speed. Live servers connections are extremely fast, that is why it is able to pull the entire html page of the target url.

<?php
## include the dom.php
include 'dom.php';
## define your target url
$url = 'http://books.rediff.com/categories/fiction-genres/2180204';

## your function as derived from your posted question on daniwed
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
## end of your function




    $outhtml = get_url_contents($url);
    $html= str_get_html($outhtml); 


foreach($html->find('a')as $link){

    //echo $link;
    foreach($link->find('img')as $value){
     ## filter extension.. we know books thumbnails are more likely to be jpg
     $ext = substr(strrchr($value->src, '.'), 1);

     ## we don't want gif images, we only want the book thumbnails.
     if($ext != 'gif'){
     echo '<img src="'.$value->src.'"/><br/>';
        echo html_entity_decode($value->title)."<br/>";

        }
        //echo $ext."<br/>";

    }
}



//print_r($outhtml);

?>

If I get extra time later on, or my lazyness to get back on my project does not subside, I will modify this script to be able to send search query to rediff.com . That's pretty easy than the project I am currently working on right now.

Problem in Regular expressions ...?

Recommended Answers Collapse Answers

All 30 Replies

Recommended Answers