0

I want to be able to get the html of many different sites and parse them to pull the article text from them, it cant be specific for one site as i want to use it form many sites, what would be the best way of doing this?

2
Contributors
3
Replies
5
Views
4 Years
Discussion Span
Last Post by diafol
1

First, make sure you have permission to use data from these sites.
Once that's cleared up you have a number of options.
Those sites may have an API/XML/RSS/REST for extracting data. If so - use that - it will be more reliable that trying to scrape data from a page.
Otherwise, you'll be looking to use file_get_contents or cURL. BUT beware - this could be a potential security issue unless you lock down and parse all the data properly.
Extracting data from remote sites will slow down your page significantly. Think about how many requests you really NEED to make. BTW - images etc - although a site may give you permission to use data - it may not allow you to use images, the licensing rights to which, may not be held by site owner. Take care. Check the small print for syndication. Some RSS feeds that I'm allowed to reproduce on my site stipulate that only a certain number of articles (and then only the abstracts) may be reproduced.

Edited by diafol

0

I get the title of the article along with a small description from an rss feed but it does not contain the full article

0

Ok, still, check that you're allowed to reproduce it on your site. The right to private consumption of an RSS feed isn't the same as displaying it on your site.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.