I want to use the function preg_match_all to catch data in a web page
------------------------------------------------------------------------------------------
int preg_match_all ( string $pattern , string $subject , array &$matches [, int $flags [, int $offset ]] )
Searches subject for all matches to the regular expression given in pattern and puts them in matches in the order specified by flags .

After the first match is found, the subsequent searches are continued on from end of the last match.
------------------------------------------------------------------------------------------
for example, if I want to catch the text between
<p id="review_45616996">
and
</p>

how should I write the expression?

Recommended Answers

All 18 Replies

<p.*>(.*)</p> <p.*> matches the opening tag (.*) is a backreference to the remaining content </p> matches the end tag

It doesn't work, can you take a look at my code?

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// Open the connection with the options specified
$cr = curl_exec($ch);


// Run your regular expression against the source to pull what you want, you can use external programs to format the html for easier parsing if you want before you scan it.
preg_match_all("<p.*>(.*)</p>",$cr,$pm,PREG_SET_ORDER);

// So you can see what you found
print_r($pm);

// Display the results again :D
foreach($pm as $pv) echo $pv[1] . "\r\n";

?>

The result:
-----------------------------------------------------------------------------
d7010702@cslinux1:~/PHP> ./curl.php
X-Powered-By: PHP/4.1.0
Content-type: text/html

<br>
<b>Warning</b>: Unknown modifier '(' in <b>/home/cs07/d7010702/PHP/curl.php</b> on line <b>15</b><br>
Array
(
)
d7010702@cslinux1:~/PHP>
-----------------------------------------------------------------------------

ok, I know perl regex need '/' in the beginning and the end
and the result is any string match <p>*</p>
but I only want any thing between <p id="review_[Any Number]"> and </p>
how should I write

I found that I dont know how to display the array...
how are they structure? Let say by PREG_SET_ORDER.....
I am pretty sure they are multi-dimensional array,
but not all of them are useful to me....

Try this. On the page are two <h2 class="center"> which will be extracted and shown.

<pre>
<?php
 
// Return a handle to a curl connection to the site you want to pull info from
//$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
$ch = curl_init('http://www.kledingbank-limburg.nl/page/home');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

// Run your regular expression against the source to pull what you want, you can use external programs to format the html for easier parsing if you want before you scan it.
preg_match_all("~<h2.*?>(.*?)</h2>~", $cr, $pm, PREG_SET_ORDER);
 
// Display the results again :D
foreach($pm as $pv) 
	echo $pv[1] . '<br/><br/>';
?>

I tried your code and applied on my code, and it return a little bit more than I need:
it also return <p>, but I only want <p id="XXXXX">
Can you teach me how to write one, instead of telling my the answer?

http://perldoc.perl.org/perlretut.html

Is a good overview of perl regular expression syntax.

If you're looking for something that matches the format <p id="DIGITS HERE"> BLAH BLAH </p>,

Then the regular expression would be:
"<p\s*id=\"(\d*)\"\s*>(.*)</p>"

A brief explanation of this:

\s means any whitespace, which is useful incase some weird html writer decided to put tabs in the middle of his tags. the * character means 0 or more of the preceding character (or set of characters), which in the first case is the \s
\" this is to escape the ".
(\d*) the parenthesis provide a back reference, allowing you to access this ID later if you need to. \d represents all digits, if your id's are more than just numbers you can use the '.' character to represent all characters.

it is not working...didn't return anything.....
first, I just tried your regex, <b>Warning</b>: Unknown modifier '(' in <b>/home/cs07/d7010702/PHP/curl.php</b> on line <b>17</b><br>

then I backslashed the </p>, become <\/p>, same: <b>Warning</b>: Unknown modifier '(' in <b>/home/cs07/d7010702/PHP/curl.php</b> on line <b>17</b><br>

then I delimited the whole regex : "/<p\s*id=\"(\d*)\"\s*>(.*)<\/p>/"
didn't return anything :
X-Powered-By: PHP/4.1.0
Content-type: text/html

Array
(
)

The code is here:

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// open file
$myFile = "testFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");

// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all("/<p\s*id=\"(\d*)\"\s*>(.*)<\/p>/", $cr, $pm, PREG_SET_ORDER);

print_r($pm);


fclose($fh);

?>

I normally try my regexes in RegexBuddy from JGSoft. You can download a trial. It helps you build and test your regex. Maybe it's best you start from there.

I downloaded the regexbuddy, and made a regex
the regex pass the test case of <p id="23534534">I love s</p>
but when I paste that on the program
it turn out :Unknown modifier '(' in ........

help!!

Paste the line of code. Did you compare with mine ?

Here is the code:

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// open file
$myFile = "testFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");

// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all("<p id=\"(\d*)\">(.*)<\/p>", $cr, $pm, PREG_SET_ORDER);

print_r($pm);


fclose($fh);

?>
preg_match_all("~<p id=\"(\d*)\">(.*)<\/p>~", $cr, $pm, PREG_SET_ORDER);
preg_match_all("~<p id=\"(\d*)\">(.*)<\/p>~", $cr, $pm, PREG_SET_ORDER);

nothing returned.......

<?php
// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all('~<p id=".*?(\d*)">(.*?)</p>~', $cr, $pm, PREG_SET_ORDER);
 
print_r($pm); 
?>

You can see the output here: http://www.pritaeas.net/public/daniweb/regex2/
Then view source in your browser.

<?php
// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all('~<p id=".*?(\d*)">(.*?)</p>~', $cr, $pm, PREG_SET_ORDER);
 
print_r($pm); 
?>

You can see the output here: http://www.pritaeas.net/public/daniweb/regex2/
Then view source in your browser.

IT WORKS!! Really thanks!!
And what is the structure of the array?
how can I access only the plaintext result?(with no the tag)

ok, I got it by doing this:

$i=0;
foreach($pm as $pv){
        print_r("\r\nThe ");
        print_r($i);
        print_r(" entry:\r\n");
        print_r($pv[2]);
        $i++;
}
foreach ($pm as $match)
  echo $match[2] . '<hr/>';
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.