how do you manipulate text files..

import urllib2
dload = urllib2.Request('http://tvnz.co.nz/content/tv2_listings_data/portable_listings_skin')
text_file = urllib2.urlopen(dload)
text_file_read = text_file.read()

print text_file_read

f = open ('site.txt', 'w'
f.write(text_file_read) 
f.close()

This returns the page and saves it as text file, but how would i do the following things..
-delete lines of texts, for example say delete from lines 1-10.
-Return/delete everything between '<body>' and '</body>' etc.
-search and delete all '<b>' and </b> etc?

I've read about 'parsers' but not quite sure how to use them but either way wouldn't this way be easier?

-Thanks

For line count, it is very easy (this is to be put after your own code) :

f = open ('out.txt', 'w')
for i, l in enumerate(file('site.txt', 'r')):
    if i > 10:
        f.write(l)

To delete part of the text, you can use regexp (this example erase the body part) :

import re
pat=re.compile("<body.*</body>", re.DOTALL) # re.DOTALL is to include \n
f = open ('out2.txt', 'w')
f.write(pat.sub("", text_file_read))
f.close()

You can also get the content of the part this way (have a look at the re module)

You can also parse your file but if you don't have any condition on the content to erase a part, it may be simpler to do this this way :)

To parse html files, you can use the HTMLParser module
For xml or xhtml files, you can use xml.sax or xml.dom modules

also, for xml :
http://www.devshed.com/c/a/Python/Working-with-XML-Documents-and-Python/
http://www.devshed.com/c/a/Python/Parsing-XML-with-SAX-and-Python/

http://www.google.fr/search?hl=fr&q=python+xml+tutorial&btnG=Rechercher&meta=

and for string manipulation
http://www.devshed.com/c/a/Python/String-Manipulation/

cheers:) , the xml parsing tutorial was very helpful.
Is there a way to find keywords and delete them?.
for eg. it looks for '<title>' and deletes all '<title>'s in the text?

You can use simple string functions:

html = """\
<html>
<head>
   <title>Ordered List Example</title>
</head>
<body bgcolor="#FFFF00" text="#FF0000">
<OL STYLE = "font-family: comic sans ms; font-size:10pt">
   <LI>Religious tolerance</LI>
   <LI>Exact estimate</LI>
   <LI>Military Intelligence</LI>
   <LI>Passive aggression</LI>
   <LI>Tight slacks</LI>
   <LI>Business ethics</LI>
   <LI>Advanced BASIC</LI>
   <LI>Extinct Life</LI>
   <LI>Pretty ugly</LI>
   <LI>Genuine imitation</LI>
</OL>
</body>
</html>
"""

# erase just "<title>" 
print(html.replace("<title>", ""))

print('-'*50)

# erase "<title>" and "</title>"
print(html.replace("<title>", "").replace("</title>", ""))
This question has already been answered. Start a new discussion instead.