We're a community of 1.1M IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,080,586 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Extract from an HTML file

Hi,
I'm trying to extract certain things from a web page. The website is TVRage.com, and the example I'm using at the moment is the Warehouse 13 episode list. So far I've managed to get the title of the show using this code:

#!/usr/bin/env python

import urllib

def save_page(site="http://www.tvrage.com/Warehouse_13/episode_list"):
	mypath = site
	mylines = urllib.urlopen(mypath).readlines() 
	
	f = open('temp2.txt', 'w')
	for item in mylines:	
		f.write(item)

	f.close()

def find_title(temp="temp2.txt"):
	f = open(temp, "r")
	site = f.read()
	f.close()
	
	search1 = "<title>"
	search2 = " (Episode"
	starter = site.find(search1)
	ender   = site.find(search2)
	#print "Starts at %s and ends at %s" % (starter, ender) Just gives the indexes
	print site[(starter+19):ender]

Now I'm trying to get episode numbers, dates, and titles, the only problem is I can't figure out how to extract them from the html. So far I've tried this code to no effect:

def find_episodes(temp="temp2.txt"):
	f = open(temp, "r")
	site = f.read()
	f.close()
	
	for line in site:
		if '/Warehouse_13/episodes/1064905360' in line:
			print line
		else:
			print "We got nothing."

Any suggestions would help tremendously.

3
Contributors
2
Replies
1 Year
Discussion Span
1 Year Ago
Last Updated
3
Views
theweirdone
Newbie Poster
10 posts since Dec 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Like this?

#!/usr/bin/env python

import urllib

def save_page(site="http://www.tvrage.com/Warehouse_13/episode_list"):
    mypath = site
    f = open('temp2.txt', 'w')
    for item in urllib.urlopen(mypath).readlines():    
        f.write(item)
    f.close()

def find_title(temp="temp2.txt"):
    f = open(temp)
    site = f.readlines()
    f.close()
    for item in site:
        if item.find('<title>') != -1:
            before_html, tag_before, rest_html = str(item).partition('<title>')
            title, tag_after, after_html = rest_html.partition('</title>')
    print 'Title:', title

def find_episodes(temp="temp2.txt"):
    f = open(temp)
    site = f.readlines()
    f.close()
    for item in site:
        if item.find('''onmouseover="showToolTip2(event,'View Trailer');return false;" onmouseout="hideToolTip2();" ></a> <a href='/Warehouse_13/episodes/''') != -1:
            before_html, tag_before, rest_html = str(item).partition('''onmouseover="showToolTip2(event,'View Trailer');return false;" onmouseout="hideToolTip2();" ></a> <a href='/Warehouse_13/episodes/''')
            title, tag_after, after_html = rest_html.partition('</a> </td>')
            print 'Episodes:', title[12:]

save_page()
find_title()
find_episodes()

Happy coding.

Beat_Slayer
Posting Pro in Training
405 posts since Jun 2010
Reputation Points: 30
Solved Threads: 105
Skill Endorsements: 1

Hello

I am totally new to python and would like to develop a script.

The followings are my requirements:

1) I want to extract a number from the webpage and constantly monitor the number change.
2) Once there is a change in number, the script should compare with the number extracted earlier.
3) If new number is greater, the script should trigger to do some tasks. (i.e something like API for interface with another script).

Please see the attached HTML image to understand more.
Thank you very much in advance.

Attachments HTML.JPG 23.03KB
Aung Myat
Newbie Poster
2 posts since Feb 2012
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page generated in 0.0596 seconds using 2.7MB