I'm building my own html parser in python, and have ran into some problems.
First off, I'm using python 3, so I can't use the old bundled sgmlparser, or beautiful soup and could not find windows binaries for lxml, so I'm rolling my own. It is for my master thesis, so it's not that wasted anyway. The parser will be used to parse pages I find with my crawler for statistical analysis.
What I use: regex. I found this beautiful regex
(?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?> that works like a charm. I get every tag in the page and I track the start and end positions of the tag.
document.write('<SCRIPT LANGUAGE=VBScript\> \n'); document.write('on error resume next \n'); document.write('ShockMode = (IsObject(CreateObject("ShockwaveFlash.ShockwaveFlash.6")))\n'); document.write('<\/SCRIPT\> \n');
My regex matches the <script> tag in
document.write , but I really don't want that. Especially since it doesn't match the <\/script> tag, and that really messes up my parsing.
Anyone got any good ideas to what I can do to solve my problem?
And if someone spots any other problems I might run into with this method of parsing, I would love to be made aware of them :)