Hi all,

In my current project, I need to write python code extracting tons of pages grabbed from the web. By extraction, I mean strip all tags and comments and if possible, filter out small sections like navigation links. The only thing should be left is the length paragraph, if there's any.

1. About stripping the tags

I tried the html2text lib, but it will stop when encontered an error, like some ill-formated tag, unclosed tag, etc. Didn't figure out how to have it ignore such error.

So I end up using the old BeautifulSoup, my code snippet is

soup = BeautifulSoup(html)
    comments = soup.findAll(text = lambda text: isinstance(text, Comment))
    [comment.extract() for comment in comments]
    c = soup.findAll('script')
    [i.extract() for i in c]
    s = soup.findAll('style')
    [i.extract() for i in s]
    content = ''.join(soup.findAll(text=True))

Currently, it worked fine with a dozen documents tested. But I think it's not robust enough. Do you know any library than can do the job?

2. About filtering out short text, like title, list items

My solution is to split the whole document by '\n\n' and for each 'paragraph', I used a simple predicate to determine whether it's lengthy paragraph(has at least one ',' and have more than 200 chars):

text.find(',') != -1 and len(text) >= 200

Of course it didn't work very well. Do you have any suggestions?

Thanks a lot.

It certainly can't exclude all

Currently, it worked fine with a dozen documents tested. But I think it's not robust enough. Do you know any library than can do the job?

BeautifulSoup is very robust,not many parser are so good.
You have lxml that is good,it also has BeautifulSoup and html5lib build in.
lxml has also xpath.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.