| | |
Removing javascript from html
Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved |
•
•
Join Date: Jul 2007
Posts: 66
Reputation:
Solved Threads: 14
Hello
I'm building my own html parser in python, and have ran into some problems.
First off, I'm using python 3, so I can't use the old bundled sgmlparser, or beautiful soup and could not find windows binaries for lxml, so I'm rolling my own. It is for my master thesis, so it's not that wasted anyway. The parser will be used to parse pages I find with my crawler for statistical analysis.
What I use: regex. I found this beautiful regex
The problem: I'm really not interested in whatever goes on between <script></script> tags. Since script tags cannot contain html, I thought it was just a matter of matching the start and end tag and remove whatever in between. But it was not that easy. The biggest problem I face is javascript-code that outputs javascript code itself!
An example:
My regex matches the <script> tag in
Anyone got any good ideas to what I can do to solve my problem?
And if someone spots any other problems I might run into with this method of parsing, I would love to be made aware of them
Best regards
Vidaj
I'm building my own html parser in python, and have ran into some problems.
First off, I'm using python 3, so I can't use the old bundled sgmlparser, or beautiful soup and could not find windows binaries for lxml, so I'm rolling my own. It is for my master thesis, so it's not that wasted anyway. The parser will be used to parse pages I find with my crawler for statistical analysis.
What I use: regex. I found this beautiful regex
(?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?> that works like a charm. I get every tag in the page and I track the start and end positions of the tag. The problem: I'm really not interested in whatever goes on between <script></script> tags. Since script tags cannot contain html, I thought it was just a matter of matching the start and end tag and remove whatever in between. But it was not that easy. The biggest problem I face is javascript-code that outputs javascript code itself!
An example:
javascript Syntax (Toggle Plain Text)
document.write('<SCRIPT LANGUAGE=VBScript\> \n'); document.write('on error resume next \n'); document.write('ShockMode = (IsObject(CreateObject("ShockwaveFlash.ShockwaveFlash.6")))\n'); document.write('<\/SCRIPT\> \n');
My regex matches the <script> tag in
document.write , but I really don't want that. Especially since it doesn't match the <\/script> tag, and that really messes up my parsing.Anyone got any good ideas to what I can do to solve my problem?
And if someone spots any other problems I might run into with this method of parsing, I would love to be made aware of them

Best regards
Vidaj
Somebody mention this in another thread.
You can convert BeautifulSoup.py from Python25 to Python30 with 2to3.py and it will work with Python30. You can use this little utility program:
]
You can convert BeautifulSoup.py from Python25 to Python30 with 2to3.py and it will work with Python30. You can use this little utility program:
python Syntax (Toggle Plain Text)
# convert a Python25 code file to a Python30 code file # generates a backup file and overwrites the original # file with the converted file # to be safe copy the file to be converted into the # working directory of this program import subprocess # the Python2x code file you want to convert ... python2x_scriptfile = "BeautifulSoup.py" subprocess.call([r"C:\Python30\Python.exe", r"C:\Python30\Tools\Scripts\2to3.py", "-w", python2x_scriptfile])
No one died when Clinton lied.
Ouch! My baddy!
BeautyfulSoup needs sgmllib.
If you are smart, stay away from Python30 and use Python version 2.5.4 the most stable production grade version.
Consider Python26 and Python30 experimental versions at best.
BeautyfulSoup needs sgmllib.
If you are smart, stay away from Python30 and use Python version 2.5.4 the most stable production grade version.
Consider Python26 and Python30 experimental versions at best.
Last edited by sneekula; Apr 4th, 2009 at 1:41 pm.
No one died when Clinton lied.
•
•
Join Date: Jul 2007
Posts: 66
Reputation:
Solved Threads: 14
I found a solution
It was quite obvious but I just didn't see the answer earlier
I solved it by doing a two-phase scan of the html. First I find all tags in the document. Then I compile a list of every <script> tag, and a list for every </script> tag. Then I balance them out based on their start and stop-positions in the document. i.e. if a <script> tag comes before the last <script>-tag was ended with a </script>, I remove it. And the other way around if there's too many </scripts>.
Then I re-scan the new document without the javascript, and grab all the tags but this time I get no tags embedded in javascript, or any javascript at all
Was a little concerned about the speed of this, because I intend to process a whole lot of documents, but it turned out to be quite effective. My test-page is the frontpage of norways biggest tabloid (vg.no) and it's about 240kilobytes. It takes 0.067 seconds to do the two-phase scan, and that's not half bad. About 150 pages per second.
The main reason I use python 3 (apart from the goody feeling of living on the bleeding edge
) is the multiprocessing package. It does wonders to my crawler since I can do parallell python without having to worry about the GIL.
It was quite obvious but I just didn't see the answer earlier
I solved it by doing a two-phase scan of the html. First I find all tags in the document. Then I compile a list of every <script> tag, and a list for every </script> tag. Then I balance them out based on their start and stop-positions in the document. i.e. if a <script> tag comes before the last <script>-tag was ended with a </script>, I remove it. And the other way around if there's too many </scripts>. Then I re-scan the new document without the javascript, and grab all the tags but this time I get no tags embedded in javascript, or any javascript at all

Was a little concerned about the speed of this, because I intend to process a whole lot of documents, but it turned out to be quite effective. My test-page is the frontpage of norways biggest tabloid (vg.no) and it's about 240kilobytes. It takes 0.067 seconds to do the two-phase scan, and that's not half bad. About 150 pages per second.
The main reason I use python 3 (apart from the goody feeling of living on the bleeding edge
) is the multiprocessing package. It does wonders to my crawler since I can do parallell python without having to worry about the GIL. •
•
Join Date: Jul 2007
Posts: 66
Reputation:
Solved Threads: 14
I can post the code I have so far. It's not finished at all, but at least it's something to take a look at.
-Vidaj-
python Syntax (Toggle Plain Text)
import re import copy import time class HtmlParser(object): """ Parser for HTML. """ tagRegex = re.compile("(?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>") """ Group 0 = the whole tag from <... to > Group 1 = the name of the tag Shamelessly taken from http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx """ def __init__(self, html): """ Constructs the parser. Pass the html you want to parse as a string. """ self.rawhtml = html self.links = {} # Dictionary with all the html-tags. Index is the name of the tag in lowercase, value is a list of alle the html-tags self.indices = [] # A sorted list of where all tags start in the html. self.tagPos = {} # A map where the tags index is mapped to the tag itself. Key = index def parse(self, html=None): """ Parses tags from the html. """ if html == None: self.html = self.rawhtml else: self.html = html self.parseTags() self.balanceTags() self.javascript = self.removeJavaScript() self.parseTags() def parseTags(self): """ Parses all tags from self.html. """ tagPos = {} indices = [] links = {} for match in self.tagRegex.finditer(self.html): name = match.group(1).lower() value = (name, match.group(0), match.start(), match.end()) indices.append(match.start()) tagPos[match.start()] = value if name not in links.keys(): links[name] = [value] else: links[name].append(value) indices.sort() self.links, self.tagPos, self.indices = links, tagPos, indices def balanceTags(self): """ Balances tags """ if 'script' in self.links: self.balanceTag('script') def balanceTag(self, tagname): """ Tries to balance out the start and close tags of a specific tagname. """ scriptStarts = self.links[tagname] scriptStops = self.links['/{0}'.format(tagname)] startLen = len(scriptStarts) stopLen = len(scriptStops) if startLen > stopLen: # Something is amiss. i.e. a <script> is inside a <script>. Let's find it! for i, v in enumerate(scriptStarts): if i + 1 == len(scriptStarts): break start = v[2] nextStart = scriptStarts[i + 1][2] stop = scriptStops[i][3] if nextStart < stop: scriptStarts.remove(scriptStarts[i + 1]) elif startLen < stopLen: # There is too many close tags! let's find them and kill them! for i, v in enumerate(scriptStops): stop = v[3] start = scriptStarts[i][2] if stop < start: scriptStops.remove(v) def countStartTag(self, tagname): """ Counts the number of start-tags with the specified name. """ return len(self.links[tagname]) def countEndTag(self, tagname): """ Counts the number of closing-tags with the specified name. No '/' is needed in tagname. """ return len(self.links["/{0}".format(tagname)]) def reparse(self): """ Reparses the html """ self.parse(self.html) def getTags(self, tagname): """ Returns a copy of the lists of all tags with the specified name. """ return copy.copy(self.links[tagname]) def removeJavaScript(self): """ Removes javascript from the html """ html = "" removed = "" tagCount = self.countStartTag('script') startTags = self.links['script'] stopTags = self.links['/script'] lastStop = 0 for start, stop in zip(startTags, stopTags): html += self.html[lastStop:start[2]] removed += self.html[start[2]:stop[3]] lastStop = stop[3] html += self.html[lastStop:] self.html = html return removed if __name__ == '__main__': #htmlFileOnDisk = '<<insert name here>>' with open('VGNettForsiden.htm', 'r') as file: html = file.readlines() htmlString = "".join(html) before = time.time() parser = HtmlParser(htmlString) parser.parse() after = time.time() print("Parsing took {0} seconds.".format(after - before)) a = parser.getTags('a')[1][1] print(a)
-Vidaj-
![]() |
Similar Threads
- How do i see the source of code created with JavaScript? (JavaScript / DHTML / AJAX)
- javascript code not workign in firefox (JavaScript / DHTML / AJAX)
- adding/removing textboxes error (JavaScript / DHTML / AJAX)
- editing div tag (JavaScript / DHTML / AJAX)
- Printing bill using javascript (JavaScript / DHTML / AJAX)
- adding,removing dynamic tables and validation (JavaScript / DHTML / AJAX)
- Third Quotation Type (PHP)
- Ajax javascript test if image file exists (JavaScript / DHTML / AJAX)
- Manage Styles... (PHP)
Other Threads in the Python Forum
- Previous Thread: Python binary search
- Next Thread: How can i make a simple password protection?
| Thread Tools | Search this Thread |
Tag cloud for Python
accessdenied apache application argv beginner book change code color dictionary dynamic edit editing enter examples excel file filename float format ftp function gui homework import inches input java keyboard lapse library line lines linux list lists loop microphone mouse movingimageswithpygame mysql newb number numbers numeric output parameters parsing path phonebook port prime print program programming projects py2exe pygame pyopengl pyqt python random recursion recursive redirect remote reverse scrolledtext server session simple smtp software sprite ssh statictext string strings syntax table tennis terminal text thread threading time tkinter tlapse trick tuple tutorial ubuntu unicode unit urllib urllib2 variable windows wordgame wxpython






