html parsing

Question

Enders_Game 13 Newbie Poster

14 Years Ago

***Editted: added second problem

I have this code, how do I make it stop once it's found the first instance of the if happening. break, myparser.close(), and return don't work.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):   
        if tag == 'a' and (attrs[0][1].find('downloads&showfile') != -1):
            print(attrs[0][1])
            print(attrs[1][1])
       

info = open('download.txt', 'r').read()
myparser = MyHTMLParser()
myparser.feed(info)

2.
I have this code, How do i make it so blahblah is the name of hte file i downloaded. The name of the file is not shown in the url.

file = urllib.request.urlopen('http://beta.getdota.com/index.php?app=downloads&module=display&section=download&do=confirm_download&id=' + current)
output = open(blahblah ,'wb')
output.write(file.read())
output.close()

python

Edited 14 Years Ago by Enders_Game because: n/a

4 Contributors
5 Replies
169 Views
23 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by Enders_Game

jcao219 18 Posting Pro in Training

14 Years Ago

After print(attrs[1][1]) , maybe you can put a line self.reset() ?

I am just guessing. I've never used HTMLParser before.

woooee 814 Nearly a Posting Maven

14 Years Ago

Can you feed HTMLParser one record at a time until the data is found? It would depend on the data type of course (i;m not an HTMLParser user either).

while not myparser.found:
    ctr += 1
    myparser.feed(info[ctr])

Edited 14 Years Ago by woooee because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2010-04-11T12:52:04+00:00

I think the best solution is to raise an exception in handle_starttag when the if case is encountered. Then you write

try:
    parser.feed(info)
    parser.close()
except MyCustomError:
    pass

jcao219 18 Posting Pro in Training · Answer 2 · 2010-04-11T21:39:25+00:00

This is somewhat what Gribouillis means:

from html.parser import HTMLParser

class FoundMatchException(Exception):
   def __init__(self, value):
       self.parameter = value
   def __str__(self):
       return repr(self.parameter)


class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):   
        if tag == 'a' and (attrs[0][1].find('downloads&showfile') != -1):
            print(attrs[0][1])
            print(attrs[1][1])
            raise FoundMatchException("Stopping parser.")
       

info = open('download.txt', 'r').read()
myparser = MyHTMLParser()
try:
    myparser.feed(info)
    myparser.close()
except FoundMatchException:
    pass
finally:
    #whatever cleanup code you want.

For your 2nd problem, I know how to do it in .NET, and I'm looking for a way to do that using the Python standard library.

EDIT:
does this work?

file = urllib.request.urlopen('http://beta.getdota.com/index.php?app=downloads&module=display&section=download&do=confirm_download&id=' + current)
fname = file.info().get_filename()
output = open(fname ,'wb')
output.write(file.read())
output.close()

Enders_Game 13 Newbie Poster · Answer 3 · 2010-04-12T02:10:59+00:00

Thanks a lot for the responses guys.

Here's my solution i found if you guys are curious. I havent gotten rid of the redundant code yet though. I didn't post exactly what i wanted to do with the code in my original question so sorry bout htat.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    x = None
    def handle_starttag(self, tag, attrs):
        if tag == 'a' and (attrs[0][1].find('downloads&showfile') != -1):
            self.x = attrs[0][1]
    def get_info(self):
        return self.x

info = open('download.txt', 'r').read()
parser = MyHTMLParser()

for line in info:
    parser.feed(line)
    temp = parser.get_info()
    if temp != None:
        break
print(temp)

Also @jcao
your second solution worked perfectly thankyou very much ^^