Hello

I'm building my own html parser in python, and have ran into some problems.

First off, I'm using python 3, so I can't use the old bundled sgmlparser, or beautiful soup and could not find windows binaries for lxml, so I'm rolling my own. It is for my master thesis, so it's not that wasted anyway. The parser will be used to parse pages I find with my crawler for statistical analysis.

What I use: regex. I found this beautiful regex (?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?> that works like a charm. I get every tag in the page and I track the start and end positions of the tag.

The problem: I'm really not interested in whatever goes on between <script></script> tags. Since script tags cannot contain html, I thought it was just a matter of matching the start and end tag and remove whatever in between. But it was not that easy. The biggest problem I face is javascript-code that outputs javascript code itself!

An example:

document.write('<SCRIPT LANGUAGE=VBScript\> \n');
document.write('on error resume next \n');
document.write('ShockMode = (IsObject(CreateObject("ShockwaveFlash.ShockwaveFlash.6")))\n');
document.write('<\/SCRIPT\> \n');

My regex matches the <script> tag in document.write , but I really don't want that. Especially since it doesn't match the <\/script> tag, and that really messes up my parsing.

Anyone got any good ideas to what I can do to solve my problem?

And if someone spots any other problems I might run into with this method of parsing, I would love to be made aware of them :)


Best regards

Vidaj

Recommended Answers

All 5 Replies

Somebody mention this in another thread.

You can convert BeautifulSoup.py from Python25 to Python30 with 2to3.py and it will work with Python30. You can use this little utility program:

# convert a Python25 code file to a Python30 code file
# generates a backup file and overwrites the original
# file with the converted file
# to be safe copy the file to be converted into the
# working directory of this program

import subprocess

# the Python2x code file you want to convert ...
python2x_scriptfile = "BeautifulSoup.py"

subprocess.call([r"C:\Python30\Python.exe",
    r"C:\Python30\Tools\Scripts\2to3.py",
    "-w",
    python2x_scriptfile])

]

Ouch! My baddy!
BeautyfulSoup needs sgmllib.
If you are smart, stay away from Python30 and use Python version 2.5.4 the most stable production grade version.

Consider Python26 and Python30 experimental versions at best.

I found a solution :) It was quite obvious but I just didn't see the answer earlier :P I solved it by doing a two-phase scan of the html. First I find all tags in the document. Then I compile a list of every <script> tag, and a list for every </script> tag. Then I balance them out based on their start and stop-positions in the document. i.e. if a <script> tag comes before the last <script>-tag was ended with a </script>, I remove it. And the other way around if there's too many </scripts>.

Then I re-scan the new document without the javascript, and grab all the tags but this time I get no tags embedded in javascript, or any javascript at all :)

Was a little concerned about the speed of this, because I intend to process a whole lot of documents, but it turned out to be quite effective. My test-page is the frontpage of norways biggest tabloid (vg.no) and it's about 240kilobytes. It takes 0.067 seconds to do the two-phase scan, and that's not half bad. About 150 pages per second.


The main reason I use python 3 (apart from the goody feeling of living on the bleeding edge :P) is the multiprocessing package. It does wonders to my crawler since I can do parallell python without having to worry about the GIL.

vidaj, wow! Thanks for letting us know why you were using Python30. Somebody mention the BeautyfulSoup conversion on a thread here, but it seems to be BS.

Anyway that you may want to show your code here?

I can post the code I have so far. It's not finished at all, but at least it's something to take a look at.

import re
import copy
import time

class HtmlParser(object):
    """
    Parser for HTML. 
    """
    
    tagRegex = re.compile("(?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>")
    """
    Group 0 = the whole tag from <... to >
    Group 1 = the name of the tag
    
    Shamelessly taken from http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
    """
    
    def __init__(self, html):
        """
        Constructs the parser.
        Pass the html you want to parse as a string.
        """
        self.rawhtml = html
        self.links = {}      # Dictionary with all the html-tags. Index is the name of the tag in lowercase, value is a list of alle the html-tags
        self.indices = []    # A sorted list of where all tags start in the html. 
        self.tagPos = {}     # A map where the tags index is mapped to the tag itself. Key = index
    
    def parse(self, html=None):
        """
        Parses tags from the html.
        """
        if html == None: self.html = self.rawhtml
        else: self.html = html
    
        self.parseTags()
        self.balanceTags()
        self.javascript = self.removeJavaScript()
        self.parseTags()
        
    def parseTags(self):
        """
        Parses all tags from self.html.
        """
        tagPos = {}
        indices = []
        links = {}
        for match in self.tagRegex.finditer(self.html):
            name = match.group(1).lower()
            value = (name, match.group(0), match.start(), match.end())
            indices.append(match.start())
            tagPos[match.start()] = value
            if name not in links.keys(): links[name] = [value]
            else: links[name].append(value)
        
        indices.sort()
        self.links, self.tagPos, self.indices = links, tagPos, indices

    def balanceTags(self):
        """
        Balances tags
        """
        if 'script' in self.links: self.balanceTag('script')

    def balanceTag(self, tagname):
        """
        Tries to balance out the start and close tags of a specific tagname.
        """
        scriptStarts = self.links[tagname]
        scriptStops = self.links['/{0}'.format(tagname)]
        
        startLen = len(scriptStarts)
        stopLen = len(scriptStops)
        
        if startLen > stopLen:
            # Something is amiss. i.e. a <script> is inside a <script>. Let's find it!
            for i, v in enumerate(scriptStarts):
                if i + 1 == len(scriptStarts): break
                start = v[2]
                nextStart = scriptStarts[i + 1][2]
                stop = scriptStops[i][3]
                if nextStart < stop:
                    scriptStarts.remove(scriptStarts[i + 1])
        
        elif startLen < stopLen:
            # There is too many close tags! let's find them and kill them!
            for i, v in enumerate(scriptStops):
                stop = v[3]
                start = scriptStarts[i][2]
                if stop < start:
                    scriptStops.remove(v)
    
    def countStartTag(self, tagname):
        """
        Counts the number of start-tags with the specified name.
        """
        return len(self.links[tagname])
    
    def countEndTag(self, tagname):
        """
        Counts the number of closing-tags with the specified name. No '/' is needed in tagname.
        """
        return len(self.links["/{0}".format(tagname)])
    
    def reparse(self):
        """
        Reparses the html
        """
        self.parse(self.html)
    
    def getTags(self, tagname):
        """
        Returns a copy of the lists of all tags with the specified name.
        """
        return copy.copy(self.links[tagname])

    def removeJavaScript(self):
        """
        Removes javascript from the html
        """
        html = ""
        removed = ""
        tagCount = self.countStartTag('script')
        startTags = self.links['script']
        stopTags = self.links['/script']
        
        lastStop = 0
        for start, stop in zip(startTags, stopTags):
            html += self.html[lastStop:start[2]]
            removed += self.html[start[2]:stop[3]]
            lastStop = stop[3]
        html += self.html[lastStop:]
        self.html = html
        return removed


if __name__ == '__main__':
    
    #htmlFileOnDisk = '<<insert name here>>'
    with open('VGNettForsiden.htm', 'r') as file:
        html = file.readlines()
    htmlString = "".join(html)
    
    before = time.time()
    parser = HtmlParser(htmlString)
    parser.parse()
    
    after = time.time()
    print("Parsing took {0} seconds.".format(after - before))
    
    
    
    a = parser.getTags('a')[1][1]
    print(a)

-Vidaj-

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.