Scrap information from a web site

Question

furianera 0 Newbie Poster

12 Years Ago

Hello everyone, i read all the discussion about "web scraping" here in DaniWeb forum but i didn't found a solution to my problem.
I have to extract "title" and "content" of news from a website. I wrote (after reading a lot of "tutorial") these lines:

$dom = new DOMDocument();
    $dom->load('http://www.php.net');
    $title = $dom->getElementsByTagName('h2');
for ($i = 0; $i < $title->length; $i++)
        echo $title->item($i)->nodeValue . "<br/>";   ?>

Everything works fine printing all "h2" content. Anyway if i need to scrap other elements from the page, i tried to create another variable called $content and add a new foreach but it doesn't work.
I think this is not the best way to create a web-scraper for the url that i have to scrap, and i ask if someone could provide me some tutorial to understad better everything, or suggest me a php lib easy to use. I read also the tutorial on www.php.net and googling around but i still have some doubt.

php

4 Contributors
11 Replies
167 Views
2 Weeks Discussion Span
Latest Post 12 Years Ago Latest Post by Squidge

All 11 Replies

urtrivedi 276 Nearly a Posting Virtuoso

12 Years Ago

curl is another way of doing

urtrivedi 276 Nearly a Posting Virtuoso

12 Years Ago

you may try two links, I hope it helps

http://php-html.sourceforge.net/html2text.php

http://www.chuggnutt.com/html2text

diafol

12 Years Ago

It may be obvious, but ensure you have permission to do this. Some companies get a bit narked with people scraping their info - unless they have a facility (API/REST/XML etc) specifically for this purpose.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

furianera 0 Newbie Poster · Answer 1 · 2012-09-02T14:38:35+00:00

I know but there aren't (or i didn't search properly) useful tutorial to understand the basis and how to operate. Can you suggest something?

urtrivedi 276 Nearly a Posting Virtuoso · Answer 2 · 2012-09-02T14:40:35+00:00

urtrivedi 276 Nearly a Posting Virtuoso

12 Years Ago

what exactly you want to extract?

furianera 0 Newbie Poster · Answer 3 · 2012-09-03T09:22:27+00:00

i want to extract the title and the content of news from a news_website. I used the library simple_html_dom and i have done a part of the scraper for my project. I extracted the title and the content of all the news, but in the content there are also the link of the entire article, and the comment. Everything is saved in an array with 2 element, index 0 = title and 1 = content. How i can change the content of the index 1 of the array and remove the part of html with the link and comment?

furianera 0 Newbie Poster · Answer 4 · 2012-09-03T13:16:44+00:00

Thanks for all the kind reply, i will have a look to the links suggested. Writw you soon.

furianera 0 Newbie Poster · Answer 5 · 2012-09-03T23:03:12+00:00

Thanks for the information diafol. I have already written to the webmaster to inform him about the scraping done by my project. He has authorized everything, only asking to cite the website with a "link" who, obviously, i will put in the bottom of my website and in the documentation.

furianera 0 Newbie Poster · Answer 6 · 2012-09-07T02:26:26+00:00

Hi agan, i made the scraper for my project with success also the part about writing into a file the part of the array.
Now i have another issue and i try to be brief and clear: I have an array structured like this:

$articles[] = array(
                        'type' => "Kind of news", 
                        'label' => " ",
                        'title => $post->children(0)->plaintext, 
                        'content' => $post->children(2)->plaintext,
                        'hasPoliticians' => "",
                        'hasPerson' => " ",
                        'fact' => " "
                       );

and all the content of 'title' and 'content' are, like already said, taken from the website.
So in the title and content there are some politicians nominated, some fact and some person. What i need to do is make a function that read all the element of the array 'articles', match the title and content with a list (or an array) of person, politicians and fact, and if there are some name or fact inside the 'content' of a article, then the name found has to be insert into 'hasPoliticians' element of the array correspondent to the index of the news where the name has been found.Same things about person and fact.
Can someone please help me found the correct way?
regards

furianera 0 Newbie Poster · Answer 7 · 2012-09-17T09:30:56+00:00

furianera 0 Newbie Poster

12 Years Ago

Solved :)

Squidge 101 Newbie Poster · Answer 8 · 2012-09-17T17:39:20+00:00

@furianera;

Could you post your code? I am in need of a scrape that deals with multiple tags

Scrap information from a web site

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers