HTML Parser

Question

dan_code_guru 0 Newbie Poster

12 Years Ago

I want to be able to get the html of many different sites and parse them to pull the article text from them, it cant be specific for one site as i want to use it form many sites, what would be the best way of doing this?

html-css php

2 Contributors
3 Replies
222 Views
2 Hours Discussion Span
Latest Post 12 Years Ago Latest Post by diafol

All 3 Replies

diafol

12 Years Ago

First, make sure you have permission to use data from these sites.
Once that's cleared up you have a number of options.
Those sites may have an API/XML/RSS/REST for extracting data. If so - use that - it will be more reliable that trying to scrape data from a page.
Otherwise, you'll be looking to use file_get_contents or cURL. BUT beware - this could be a potential security issue unless you lock down and parse all the data properly.
Extracting data from remote sites will slow down your page significantly. Think about how many requests you really NEED to make. BTW - images etc - although a site may give you permission to use data - it may not allow you to use images, the licensing rights to which, may not be held by site owner. Take care. Check the small print for syndication. Some RSS feeds that I'm allowed to reproduce on my site stipulate that only a certain number of articles (and then only the abstracts) may be reproduced.

Edited 12 Years Ago by diafol

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

dan_code_guru 0 Newbie Poster · Answer 1 · 2012-08-25T15:13:17+00:00

I get the title of the article along with a small description from an rss feed but it does not contain the full article

diafol · Answer 2 · 2012-08-25T16:40:21+00:00

Ok, still, check that you're allowed to reproduce it on your site. The right to private consumption of an RSS feed isn't the same as displaying it on your site.

HTML Parser

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers