We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,389 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Parsing data with Regex

Hi guys!

I've been using PHP for fun for a while, and now I'm interested in playing with some scraping. I know regex is the way to go. So I'm trying to scrape a page of 4chan. I want to grab the images and the title of the thread of the images.
So here's the URL I'm trying to scrape: http://boards.4chan.org/p/
It's a photography chan, and the idea is to grab all of the images and know the title of the thread. So if you look at the page, you can see a bunch of threads, the title in blue, then the poster ("anonymous") in green.

Right now, this grabs all of the images:

preg_match_all('#<img[^>]*>#i',
    $html,
    $posts, // will contain the posts
    PREG_SET_ORDER // formats data into an array of posts
);

But now I want to grab the title I described above. How in the world do I do that? The hard part is that each picture does not necessarily have a title because it is not the head of the thread.

I want to be able to say:
$posts[0] to get the <img> tag
$posts[1] to grab the title of the thread the image is from

Thanks!

2
Contributors
3
Replies
9 Hours
Discussion Span
8 Months Ago
Last Updated
4
Views
Ghost
Posting Whiz
354 posts since Aug 2004
Reputation Points: 12
Solved Threads: 2
Skill Endorsements: 0

Make sure you have permission to do this.

diafol
Keep Smiling
Moderator
10,672 posts since Oct 2006
Reputation Points: 1,632
Solved Threads: 1,514
Skill Endorsements: 57

Very true. I'm just doing this for development purposes... I want to see if I can get some data from them.

Ghost
Posting Whiz
354 posts since Aug 2004
Reputation Points: 12
Solved Threads: 2
Skill Endorsements: 0

OK, no prob - just that the site isn't live or if it is, make sure it hasn't got public access - just in case. Obviously it's your call, but some companies can be really protective of their data.

For your issue, I think you need to use a DOM/XML parser.

THis can be done server-side or client-side, although server-side would be better IMO. There are many such ready-made classes available and I do think that using one of these will be the easiest course of action. Rolling your own can be time-intensive and fraught with errors - easy to miss stuff. Once such class:

http://simplehtmldom.sourceforge.net/

I have not used this one, but it ranks well in Google. If you're serious about this facility, putting a few different classes through their paces would be of benefit.

diafol
Keep Smiling
Moderator
10,672 posts since Oct 2006
Reputation Points: 1,632
Solved Threads: 1,514
Skill Endorsements: 57

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page rendered in 0.0617 seconds using 2.67MB