I want to be able to get the html of many different sites and parse them to pull the article text from them, it cant be specific for one site as i want to use it form many sites, what would be the best way of doing this?

Recommended Answers

All 3 Replies

Member Avatar for diafol

First, make sure you have permission to use data from these sites.
Once that's cleared up you have a number of options.
Those sites may have an API/XML/RSS/REST for extracting data. If so - use that - it will be more reliable that trying to scrape data from a page.
Otherwise, you'll be looking to use file_get_contents or cURL. BUT beware - this could be a potential security issue unless you lock down and parse all the data properly.
Extracting data from remote sites will slow down your page significantly. Think about how many requests you really NEED to make. BTW - images etc - although a site may give you permission to use data - it may not allow you to use images, the licensing rights to which, may not be held by site owner. Take care. Check the small print for syndication. Some RSS feeds that I'm allowed to reproduce on my site stipulate that only a certain number of articles (and then only the abstracts) may be reproduced.

I get the title of the article along with a small description from an rss feed but it does not contain the full article

Member Avatar for diafol

Ok, still, check that you're allowed to reproduce it on your site. The right to private consumption of an RSS feed isn't the same as displaying it on your site.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.