Hi guys!

I've been using PHP for fun for a while, and now I'm interested in playing with some scraping. I know regex is the way to go. So I'm trying to scrape a page of 4chan. I want to grab the images and the title of the thread of the images.
So here's the URL I'm trying to scrape: http://boards.4chan.org/p/
It's a photography chan, and the idea is to grab all of the images and know the title of the thread. So if you look at the page, you can see a bunch of threads, the title in blue, then the poster ("anonymous") in green.

Right now, this grabs all of the images:

preg_match_all('#<img[^>]*>#i',
    $html,
    $posts, // will contain the posts
    PREG_SET_ORDER // formats data into an array of posts
);

But now I want to grab the title I described above. How in the world do I do that? The hard part is that each picture does not necessarily have a title because it is not the head of the thread.

I want to be able to say:
$posts[0] to get the <img> tag
$posts[1] to grab the title of the thread the image is from

Thanks!

Member Avatar

diafol

Make sure you have permission to do this.

Very true. I'm just doing this for development purposes... I want to see if I can get some data from them.

Member Avatar

diafol

OK, no prob - just that the site isn't live or if it is, make sure it hasn't got public access - just in case. Obviously it's your call, but some companies can be really protective of their data.

For your issue, I think you need to use a DOM/XML parser.

THis can be done server-side or client-side, although server-side would be better IMO. There are many such ready-made classes available and I do think that using one of these will be the easiest course of action. Rolling your own can be time-intensive and fraught with errors - easy to miss stuff. Once such class:

http://simplehtmldom.sourceforge.net/

I have not used this one, but it ranks well in Google. If you're serious about this facility, putting a few different classes through their paces would be of benefit.