how to write regular expression in PHP?

Question

gunbuster363 0 Junior Poster

15 Years Ago

I want to use the function preg_match_all to catch data in a web page
------------------------------------------------------------------------------------------
int preg_match_all ( string $pattern , string $subject , array &$matches [, int $flags [, int $offset ]] )
Searches subject for all matches to the regular expression given in pattern and puts them in matches in the order specified by flags .

After the first match is found, the subsequent searches are continued on from end of the last match.
------------------------------------------------------------------------------------------
for example, if I want to catch the text between

and

how should I write the expression?

php

3 Contributors
18 Replies
146 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by pritaeas

pritaeas 2,194 ¯\_(ツ)_/¯

15 Years Ago

<p.*>(.*) <p.*> matches the opening tag (.*) is a backreference to the remaining content  matches the end tag

Edited 15 Years Ago by pritaeas because: n/a

pritaeas 2,194 ¯\_(ツ)_/¯

15 Years Ago

Try this. On the page are two <h2 class="center"> which will be extracted and shown.

<pre>
<?php
 
// Return a handle to a curl connection to the site you want to pull info from
//$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
$ch = curl_init('http://www.kledingbank-limburg.nl/page/home');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

// Run your regular expression against the source to pull what you want, you can use external programs to format the html for easier parsing if you want before you scan it.
preg_match_all("~<h2.*?>(.*?)</h2>~", $cr, $pm, PREG_SET_ORDER);
 
// Display the results again :D
foreach($pm as $pv) 
	echo $pv[1] . '<br/><br/>';
?>

chaines51 0 Junior Poster in Training

15 Years Ago

http://perldoc.perl.org/perlretut.html

Is a good overview of perl regular expression syntax.

If you're looking for something that matches the format BLAH BLAH ,

Then the regular expression would be:
"<p\s*id=\"(\d*)\"\s*>(.*)"

A brief explanation of this:

\s means any whitespace, which is useful incase some weird html writer decided to put tabs in the middle of his tags. the * character means 0 or more of the preceding character (or set of characters), which in the first case is the \s
\" this is to escape the ".
(\d*) the parenthesis provide a back reference, allowing you to access this ID later if you need to. \d represents all digits, if your id's are more than just numbers you can use the '.' character to represent all characters.

pritaeas 2,194 ¯\_(ツ)_/¯

15 Years Ago

I normally try my regexes in RegexBuddy from JGSoft. You can download a trial. It helps you build and test your regex. Maybe it's best you start from there.

pritaeas 2,194 ¯\_(ツ)_/¯

15 Years Ago

Paste the line of code. Did you compare with mine ?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gunbuster363 0 Junior Poster · Answer 1 · 2009-11-10T21:21:40+00:00

It doesn't work, can you take a look at my code?

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// Open the connection with the options specified
$cr = curl_exec($ch);


// Run your regular expression against the source to pull what you want, you can use external programs to format the html for easier parsing if you want before you scan it.
preg_match_all("<p.*>(.*)</p>",$cr,$pm,PREG_SET_ORDER);

// So you can see what you found
print_r($pm);

// Display the results again :D
foreach($pm as $pv) echo $pv[1] . "\r\n";

?>

The result:
-----------------------------------------------------------------------------
d7010702@cslinux1:~/PHP> ./curl.php
X-Powered-By: PHP/4.1.0
Content-type: text/html

Warning: Unknown modifier '(' in /home/cs07/d7010702/PHP/curl.php on line 15 
Array
(
)
d7010702@cslinux1:~/PHP>
-----------------------------------------------------------------------------

gunbuster363 0 Junior Poster · Answer 2 · 2009-11-10T21:41:05+00:00

ok, I know perl regex need '/' in the beginning and the end
and the result is any string match *
but I only want any thing between and 
how should I write

gunbuster363 0 Junior Poster · Answer 3 · 2009-11-11T00:17:04+00:00

I found that I dont know how to display the array...
how are they structure? Let say by PREG_SET_ORDER.....
I am pretty sure they are multi-dimensional array,
but not all of them are useful to me....

gunbuster363 0 Junior Poster · Answer 4 · 2009-11-11T06:32:09+00:00

I tried your code and applied on my code, and it return a little bit more than I need:
it also return , but I only want 
Can you teach me how to write one, instead of telling my the answer?

gunbuster363 0 Junior Poster · Answer 5 · 2009-11-11T14:09:54+00:00

it is not working...didn't return anything.....
first, I just tried your regex, Warning: Unknown modifier '(' in /home/cs07/d7010702/PHP/curl.php on line 17

then I backslashed the , become <\/p>, same: Warning: Unknown modifier '(' in /home/cs07/d7010702/PHP/curl.php on line 17

then I delimited the whole regex : "/<p\s*id=\"(\d*)\"\s*>(.*)<\/p>/"
didn't return anything :
X-Powered-By: PHP/4.1.0
Content-type: text/html

Array
(
)

The code is here:

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// open file
$myFile = "testFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");

// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all("/<p\s*id=\"(\d*)\"\s*>(.*)<\/p>/", $cr, $pm, PREG_SET_ORDER);

print_r($pm);


fclose($fh);

?>

gunbuster363 0 Junior Poster · Answer 6 · 2009-11-11T15:39:22+00:00

I downloaded the regexbuddy, and made a regex
the regex pass the test case of I love s
but when I paste that on the program
it turn out :Unknown modifier '(' in ........

help!!

gunbuster363 0 Junior Poster · Answer 7 · 2009-11-11T18:18:22+00:00

Here is the code:

#!/usr/bin/php
<?php

// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');

// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source

// open file
$myFile = "testFile.txt";
$fh = fopen($myFile, 'a') or die("can't open file");

// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all("<p id=\"(\d*)\">(.*)<\/p>", $cr, $pm, PREG_SET_ORDER);

print_r($pm);


fclose($fh);

?>

pritaeas 2,194 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 8 · 2009-11-11T18:44:20+00:00

preg_match_all("~<p id=\"(\d*)\">(.*)<\/p>~", $cr, $pm, PREG_SET_ORDER);

gunbuster363 0 Junior Poster · Answer 9 · 2009-11-11T18:52:02+00:00

preg_match_all("~<p id=\"(\d*)\">(.*)<\/p>~", $cr, $pm, PREG_SET_ORDER);

nothing returned.......

pritaeas 2,194 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 10 · 2009-11-11T19:11:53+00:00

<?php
// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all('~<p id=".*?(\d*)">(.*?)</p>~', $cr, $pm, PREG_SET_ORDER);
 
print_r($pm); 
?>

You can see the output here: http://www.pritaeas.net/public/daniweb/regex2/
Then view source in your browser.

gunbuster363 0 Junior Poster · Answer 11 · 2009-11-11T19:36:40+00:00

<?php
// Return a handle to a curl connection to the site you want to pull info from
$ch = curl_init('http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html');
 
// Set some options for the connection
curl_setopt($ch,CURLOPT_HEADER,0); // Don't return header information, although, this can be handy ;)
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); // Give us the page source
 
// Open the connection with the options specified
$cr = curl_exec($ch);

preg_match_all('~<p id=".*?(\d*)">(.*?)</p>~', $cr, $pm, PREG_SET_ORDER);
 
print_r($pm); 
?>

You can see the output here: http://www.pritaeas.net/public/daniweb/regex2/
Then view source in your browser.

IT WORKS!! Really thanks!!
And what is the structure of the array?
how can I access only the plaintext result?(with no the tag)

gunbuster363 0 Junior Poster · Answer 12 · 2009-11-11T19:45:19+00:00

ok, I got it by doing this:

$i=0;
foreach($pm as $pv){
        print_r("\r\nThe ");
        print_r($i);
        print_r(" entry:\r\n");
        print_r($pv[2]);
        $i++;
}

pritaeas 2,194 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 13 · 2009-11-11T19:46:56+00:00

pritaeas 2,194 ¯\_(ツ)_/¯

15 Years Ago

foreach ($pm as $match)
  echo $match[2] . '<hr/>';