data grabbing from html sites

Question

a1eio 16 Junior Poster

19 Years Ago

hi,
i'd like to create something that basically grabbes information from websites, however i havn't any experience in urllib (apart from very basic page reading) and the issue is that the page i want to grab data from checks to see if another page is connected to it or perhaps to better phrase it: checks to see if it in the correct ifram in relation to an iframe next to it, hard to explain but i any help or pointers is appreciated.

html-css python

4 Contributors
8 Replies
343 Views
9 Months Discussion Span
Latest Post 18 Years Ago Latest Post by metabo_man

shanenin 0 Posting Whiz in Training

19 Years Ago

I just wrote a little script that grabs searches for and grabs bittorrant files. One of the versions needs to read page source.

i want to grab data from checks to see if another page is connected to it or perhaps to better phrase it:

when it comes to the web, I am not too knowlegable, not sure what that means. if you just want to grab the source from a websight and put it into a string to then parse out the needed info this will get you started

import urllib2

url = 'http://google.com'
# this line creates an object that contains the page source
page = urllib2.urlopen(url)
# using the read method this line puts the object into a string, so it can be manipulated
page_string = page.read()

now using different string methods, you can parse out any needed data

edit added later//

I just reread you post, you seem to need to do more then I just explained. Sorry I don't have something more useful to tell you.

shanenin 0 Posting Whiz in Training

19 Years Ago

could you give me a url of a sight you are trying to get data from. and explain what kind of data you need to find. Are trying to get certain urls, that link to other sights.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

a1eio 16 Junior Poster · Answer 1 · 2005-10-09T23:17:53+00:00

don't worry, it's a start, at least now i can read the sites so thanks for the help :)
goto start somewhere

a1eio 16 Junior Poster · Answer 2 · 2005-10-10T22:59:48+00:00

it's a game,
i run out of ideas and frequently ask my friends for things to code this time he said that there is a game site he plays (i now play it ;) ) but the layout is rubbish so he wanted me to code something that grabs the data from certain pages then put it in a tkinter gui style table so it's easier to work out numbers and how many of this you need to train and how many of that you need to buy, ect, ect
*back to the site*
the site only shows this 'end of turn' page (and other's) if it is still 'connected' to the side bar at the left hand side, if it's not, it redirects you to an error page. So... i want to know how to trick a webpage into thinking it is being viewed how it should (with it's main page and toolbar down the side), instead of being opened on it's own or by a program.
it's hard to explain.
and if you want to view the site you would have to register ect ect
unless of course your interested in text based, resource handling style web games

*EDIT: www.aarcsoft.com, then click on 'games' then click on the top game, then choose the server you want to connect to, then register and play, simple as that really

shanenin 0 Posting Whiz in Training · Answer 3 · 2005-10-11T00:37:05+00:00

thanks for the nice explanation. I don't have any ideas, but maybe someone else will.

metabo_man 0 Newbie Poster · Answer 4 · 2006-07-23T16:48:37+00:00

hello all

thanks for the nice explanation. I don't have any ideas, but maybe someone else will.

hi all,

same thing - same problem well i guess that we have the same interests, i also want to grab some data out of a
exisitng site - a forum.

first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions.
Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what?
well we can think of some automation that runs with WWW :: Mechanize through the forums and gets all the data

http://search.cpan.org/search?query=WWW%3A%3AMechanize++LWP%2FHTTP&mode=all
no - i need some individual threads - to analyze them - (ABout 400 to 600 threads )

some examples
http://www.phpbb.com/phpBB/viewtopic.php?t=415990
http://www.phpbb.com/phpBB/viewtopic.php?t=415980
http://www.phpbb.com/phpBB/viewtopic.php?t=415970

btw these are only examples - not out of the real forum that is out of interest.
I need the data in a allmost full and complete formate. So i need all the data like
username .-
forum
thread
topic
text of the posting and so on and so on.
how to do that?
i need some kind of a grabbing tool - can i do it with that kind of tool. How do i sove the storing-issue into the local mysql-database.
Well you see that is a tricky work - and i am pretty sure taht i am getting help here. So for any and all help i am very very thankful

many many thanks in advance

Ethno-reseracher

btw: for the automation I suggest looking at WWW::Mechanize as it encapsulates many of the lower-level web automation tools provided by perl. By the way - we *will not* find better web automation tools in any language. The LWP/HTTP suite of modules are extremely powerful.

bumsfeld 413 Nearly a Posting Virtuoso · Answer 5 · 2006-07-24T16:52:33+00:00

"Beautiful Soup" is an HTML/XML parser for Python that can turn even poorly written markup code into a parse tree, so you can extract information.

Download the free program and documentation from:
http://www.crummy.com/software/BeautifulSoup/

metabo_man 0 Newbie Poster · Answer 6 · 2006-07-25T00:32:12+00:00

hello many many thanks

"Beautiful Soup" is an HTML/XML parser for Python that can turn even poorly written markup code into a parse tree, so you can extract information.
Download the free program and documentation from:
http://www.crummy.com/software/BeautifulSoup/

guessing that this can help me.

well i look forward to learn more about it.

thanks in advande
meta