Hello everyone, i read all the discussion about "web scraping" here in DaniWeb forum but i didn't found a solution to my problem.
I have to extract "title" and "content" of news from a website. I wrote (after reading a lot of "tutorial") these lines:

$dom = new DOMDocument();
    $dom->load('http://www.php.net');
    $title = $dom->getElementsByTagName('h2');
for ($i = 0; $i < $title->length; $i++)
        echo $title->item($i)->nodeValue . "<br/>";   ?>

Everything works fine printing all "h2" content. Anyway if i need to scrap other elements from the page, i tried to create another variable called $content and add a new foreach but it doesn't work.
I think this is not the best way to create a web-scraper for the url that i have to scrap, and i ask if someone could provide me some tutorial to understad better everything, or suggest me a php lib easy to use. I read also the tutorial on www.php.net and googling around but i still have some doubt.

Recommended Answers

All 11 Replies

curl is another way of doing

I know but there aren't (or i didn't search properly) useful tutorial to understand the basis and how to operate. Can you suggest something?

what exactly you want to extract?

i want to extract the title and the content of news from a news_website. I used the library simple_html_dom and i have done a part of the scraper for my project. I extracted the title and the content of all the news, but in the content there are also the link of the entire article, and the comment. Everything is saved in an array with 2 element, index 0 = title and 1 = content. How i can change the content of the index 1 of the array and remove the part of html with the link and comment?

Thanks for all the kind reply, i will have a look to the links suggested. Writw you soon.

Member Avatar for diafol

It may be obvious, but ensure you have permission to do this. Some companies get a bit narked with people scraping their info - unless they have a facility (API/REST/XML etc) specifically for this purpose.

Thanks for the information diafol. I have already written to the webmaster to inform him about the scraping done by my project. He has authorized everything, only asking to cite the website with a "link" who, obviously, i will put in the bottom of my website and in the documentation.

Hi agan, i made the scraper for my project with success also the part about writing into a file the part of the array.
Now i have another issue and i try to be brief and clear: I have an array structured like this:

$articles[] = array(
                        'type' => "Kind of news", 
                        'label' => " ",
                        'title => $post->children(0)->plaintext, 
                        'content' => $post->children(2)->plaintext,
                        'hasPoliticians' => "",
                        'hasPerson' => " ",
                        'fact' => " "
                       );

and all the content of 'title' and 'content' are, like already said, taken from the website.
So in the title and content there are some politicians nominated, some fact and some person. What i need to do is make a function that read all the element of the array 'articles', match the title and content with a list (or an array) of person, politicians and fact, and if there are some name or fact inside the 'content' of a article, then the name found has to be insert into 'hasPoliticians' element of the array correspondent to the index of the news where the name has been found.Same things about person and fact.
Can someone please help me found the correct way?
regards

Solved :)

@furianera;

Could you post your code? I am in need of a scrape that deals with multiple tags

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.