How can i extact the personal address from html file..
After i get the source from the html file using read() method,what pattern should i consider if i wanna extact the address?
Currently i think is use the compile() method to set the matching the address' pattern, but what rule should i set for the address?

Eg)Unit 1,1 King St,Sydney NSW 2123

Recommended Answers

All 8 Replies

this may be working.

import re
datas=open("file.html").read()

expr=re.compile("<adress>(.*?)</adress>")
for match in expr.findall(datas):
    print match

thx....but it doesn't work for some html file..

if i wanna extact the address from this html doc>>>http://www.sc.iitb.ac.in/~bijnan/personal-details.htm
which is Permanent Address :
B. Bandyopadhyay,

P.O. Kirnahar 731302,

Dist. Birbhum,

West Bengal, INDIA


what pattern should i use to match it?

It's easier with a clear demand ;-)...
Here, you don't have any easy pattern to isolate the adress.
You'll need to look at the source of the html page to see the elements that can help you to identify the adress.
Here, for example, i'd look for "Permanent Address" and get the text from the following line to the following </tr> (between these two, you've got all the adress).
Then you just have to clean the text by removing all the tags.

import re
infile="personal-details.htm"
patternIN="Permanent Address" # Where to begin to keep the text
patternOUT="</tr>"  # Where to end to keep the text (after the begining)
keepText=False  # Do we keep the text ?
address=""      # We init the address
# Now, we read the file to keep the text
for line in open(infile):
    if keepText:
        address+=line.strip()  # We store the line, stripping the \n
        if patternOUT in line: # Next line won't be kept any more
            keepText=False
    if patternIN in line: # Starting from next line, we keep the text
        keepText=True

# Now, it's time to clean all this
rTags=re.compile("<.*?>") # the regexp to recognise any tag
address=rTags.sub(":", address) # we replace the tags with ":" (I could have chosen anything else,
                # especially if there is some ":" in the address
rSep=re.compile(":+") # Now, we replace any number of ":" with a \n
address=rSep.sub("\n", address)
print address

thx very very much,it is very helpful...
i have got another question is.. if the address is not beginning with Permanent Address,and not ending with </tr>,this program cannot be use...which is this program can only be used in this situation.
Eg)if the address is beginning with something like Location, Home, Live in etc...how can extact those address if so,is that possible to create a program to fit and extact the contact address from all the websites?

I'm afraid not :
in html pages, there is no way to know where the address is : no special tag or whatever. So you have to look at the sites you want to process and look how you can identify the address.
I'd even say that sometimes, it may be impossible (for example, if you hadn't "Permanent Address", the example you gave me would have been very difficult to process)

thx very much!!

i dun know wts wrong of my code:

import re
import urllib 
import urllib2 

webURL="http://www.sc.iitb.ac.in/~bijnan/personal-details.htm" #the website is
connect=urllib.urlopen(webURL) #connect to this website
htmlDoc=connect.read()#get the html document from this website

patternIN="Permanent Address" # Where to begin to keep the text
patternOUT="</tr>"  # Where to end to keep the text (after the begining)
keepText=False  # Do we keep the text ?
address=""      # We init the address

# Now, we read the file to keep the text
for line in htmlDoc:
    if keepText:
        address+=line.strip()  # We store the line, stripping the \n
        if patternOUT in line: # Next line won't be kept any more
            keepText=False
    if patternIN in line: # Starting from next line, we keep the text
        keepText=True

# Now, it's time to clean all this
rTags=re.compile("<.*?>") # the regexp to recognise any tag
address=rTags.sub(":", address) # we replace the tags with ":" (I could have chosen anything else,
                # especially if there is some ":" in the address
rSep=re.compile(":+") # Now, we replace any number of ":" with a \n
address=rSep.sub("\n", address)
print address

And what is the error ?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.