In my current project, I need to write python code extracting tons of pages grabbed from the web. By extraction, I mean strip all tags and comments and if possible, filter out small sections like navigation links. The only thing should be left is the length paragraph, if there's any.
1. About stripping the tags
I tried the html2text lib, but it will stop when encontered an error, like some ill-formated tag, unclosed tag, etc. Didn't figure out how to have it ignore such error.
So I end up using the old BeautifulSoup, my code snippet is
soup = BeautifulSoup(html) comments = soup.findAll(text = lambda text: isinstance(text, Comment)) [comment.extract() for comment in comments] c = soup.findAll('script') [i.extract() for i in c] s = soup.findAll('style') [i.extract() for i in s] content = ''.join(soup.findAll(text=True))
Currently, it worked fine with a dozen documents tested. But I think it's not robust enough. Do you know any library than can do the job?
2. About filtering out short text, like title, list items
My solution is to split the whole document by '\n\n' and for each 'paragraph', I used a simple predicate to determine whether it's lengthy paragraph(has at least one ',' and have more than 200 chars):
text.find(',') != -1 and len(text) >= 200
Of course it didn't work very well. Do you have any suggestions?
Thanks a lot.
It certainly can't exclude all