We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,348 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Scrap information from a web site

Hello everyone, i read all the discussion about "web scraping" here in DaniWeb forum but i didn't found a solution to my problem.
I have to extract "title" and "content" of news from a website. I wrote (after reading a lot of "tutorial") these lines:

$dom = new DOMDocument();
    $dom->load('http://www.php.net');
    $title = $dom->getElementsByTagName('h2');
for ($i = 0; $i < $title->length; $i++)
        echo $title->item($i)->nodeValue . "<br/>";   ?>

Everything works fine printing all "h2" content. Anyway if i need to scrap other elements from the page, i tried to create another variable called $content and add a new foreach but it doesn't work.
I think this is not the best way to create a web-scraper for the url that i have to scrap, and i ask if someone could provide me some tutorial to understad better everything, or suggest me a php lib easy to use. I read also the tutorial on www.php.net and googling around but i still have some doubt.

4
Contributors
11
Replies
2 Weeks
Discussion Span
8 Months Ago
Last Updated
13
Views
Question
Answered
furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

curl is another way of doing

urtrivedi
Posting Virtuoso
1,714 posts since Dec 2008
Reputation Points: 299
Solved Threads: 362
Skill Endorsements: 24

I know but there aren't (or i didn't search properly) useful tutorial to understand the basis and how to operate. Can you suggest something?

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

what exactly you want to extract?

urtrivedi
Posting Virtuoso
1,714 posts since Dec 2008
Reputation Points: 299
Solved Threads: 362
Skill Endorsements: 24

i want to extract the title and the content of news from a news_website. I used the library simple_html_dom and i have done a part of the scraper for my project. I extracted the title and the content of all the news, but in the content there are also the link of the entire article, and the comment. Everything is saved in an array with 2 element, index 0 = title and 1 = content. How i can change the content of the index 1 of the array and remove the part of html with the link and comment?

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
urtrivedi
Posting Virtuoso
1,714 posts since Dec 2008
Reputation Points: 299
Solved Threads: 362
Skill Endorsements: 24

Thanks for all the kind reply, i will have a look to the links suggested. Writw you soon.

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

It may be obvious, but ensure you have permission to do this. Some companies get a bit narked with people scraping their info - unless they have a facility (API/REST/XML etc) specifically for this purpose.

diafol
Keep Smiling
Moderator
10,665 posts since Oct 2006
Reputation Points: 1,628
Solved Threads: 1,514
Skill Endorsements: 57

Thanks for the information diafol. I have already written to the webmaster to inform him about the scraping done by my project. He has authorized everything, only asking to cite the website with a "link" who, obviously, i will put in the bottom of my website and in the documentation.

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Hi agan, i made the scraper for my project with success also the part about writing into a file the part of the array.
Now i have another issue and i try to be brief and clear: I have an array structured like this:

$articles[] = array(
                        'type' => "Kind of news", 
                        'label' => " ",
                        'title => $post->children(0)->plaintext, 
                        'content' => $post->children(2)->plaintext,
                        'hasPoliticians' => "",
                        'hasPerson' => " ",
                        'fact' => " "
                       );

and all the content of 'title' and 'content' are, like already said, taken from the website.
So in the title and content there are some politicians nominated, some fact and some person. What i need to do is make a function that read all the element of the array 'articles', match the title and content with a list (or an array) of person, politicians and fact, and if there are some name or fact inside the 'content' of a article, then the name found has to be insert into 'hasPoliticians' element of the array correspondent to the index of the news where the name has been found.Same things about person and fact.
Can someone please help me found the correct way?
regards

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Question Answered as of 8 Months Ago by urtrivedi and diafol

Solved :)

furianera
Newbie Poster
12 posts since Apr 2007
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

@furianera;

Could you post your code? I am in need of a scrape that deals with multiple tags

Squidge
Posting Pro in Training
413 posts since Dec 2009
Reputation Points: 111
Solved Threads: 62
Skill Endorsements: 5

This question has already been solved: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
 
© 2013 DaniWeb® LLC
Page rendered in 0.1049 seconds using 2.68MB