how do you manipulate text files..

import urllib2
dload = urllib2.Request('')
text_file = urllib2.urlopen(dload)
text_file_read =

print text_file_read

f = open ('site.txt', 'w'

This returns the page and saves it as text file, but how would i do the following things..
-delete lines of texts, for example say delete from lines 1-10.
-Return/delete everything between '<body>' and '</body>' etc.
-search and delete all '<b>' and </b> etc?

I've read about 'parsers' but not quite sure how to use them but either way wouldn't this way be easier?


For line count, it is very easy (this is to be put after your own code) :

f = open ('out.txt', 'w')
for i, l in enumerate(file('site.txt', 'r')):
    if i > 10:

To delete part of the text, you can use regexp (this example erase the body part) :

import re
pat=re.compile("<body.*</body>", re.DOTALL) # re.DOTALL is to include \n
f = open ('out2.txt', 'w')
f.write(pat.sub("", text_file_read))

You can also get the content of the part this way (have a look at the re module)

You can also parse your file but if you don't have any condition on the content to erase a part, it may be simpler to do this this way :)

To parse html files, you can use the HTMLParser module
For xml or xhtml files, you can use xml.sax or xml.dom modules

thanks jice, is there any site which shows this in depth?.

also, for xml :

and for string manipulation

cheers:) , the xml parsing tutorial was very helpful.
Is there a way to find keywords and delete them?.
for eg. it looks for '<title>' and deletes all '<title>'s in the text?

You can use simple string functions:

html = """\
   <title>Ordered List Example</title>
<body bgcolor="#FFFF00" text="#FF0000">
<OL STYLE = "font-family: comic sans ms; font-size:10pt">
   <LI>Religious tolerance</LI>
   <LI>Exact estimate</LI>
   <LI>Military Intelligence</LI>
   <LI>Passive aggression</LI>
   <LI>Tight slacks</LI>
   <LI>Business ethics</LI>
   <LI>Advanced BASIC</LI>
   <LI>Extinct Life</LI>
   <LI>Pretty ugly</LI>
   <LI>Genuine imitation</LI>

# erase just "<title>" 
print(html.replace("<title>", ""))


# erase "<title>" and "</title>"
print(html.replace("<title>", "").replace("</title>", ""))