Anybody know how to speed up beautifulsoup?

Question

gunbuster363 0 Junior Poster

15 Years Ago

I don't understand the documentation

python

6 Contributors
10 Replies
6K Views
5 Months Discussion Span
Latest Post 14 Years Ago Latest Post by amrutraj

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

BeautifulSoup is a third party module for Python2 that allows you to access even badly coded HTML code. What do you want to do with it?

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

If you have very large HTML documents you have the option to parse only selected parts of the document. Here is an example (Python2 code) ...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')
# create a list
a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

# show all the a_tag lines
for line in a_tags:
    print( line )

If you use Python2, you can also try to apply module psyco from:
http://psyco.sourceforge.net/
Psyco is a JIT i386 compiler that compiles to native i386 code rather than Python bytecode, displaying speed improvements of 3 - 10 fold.

Edited 15 Years Ago by vegaseat because: n/a

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

Then you do something like this ...

# search the http://python.org html code for all the 
# <a tag lines that have a title with the word Python in it

import re
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')

html_atag = BeautifulSoup(html, parseOnlyThese=a_tag)

# find all titles that contain the word "Python"
title_py = html_atag.findAll(attrs={'title' : re.compile("Python+")})
for line in title_py:
    print( line )

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

AutoPython 5 Junior Poster · Answer 1 · 2009-11-16T09:08:37+00:00

Your question is irrelevant to the title. Are you saying that you don't know how to use it, or you know how to use it, but you want to make it faster.

gunbuster363 0 Junior Poster · Answer 2 · 2009-11-17T12:19:53+00:00

I am using beautifulsoup, and it is too slow.
The documentation mention a way to speed up the process, but I don't understand it.
see:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document

ov3rcl0ck 25 Junior Poster · Answer 3 · 2009-11-17T21:30:14+00:00

Recoding never hurts, usually the first things I do to optimize codes is replace all range() funstions with xrange(), then I look for loops that can be replaced with lambda, then I look for if statments that can be changed to elif so it doesn't have to go over unnecessary if statments, then i look for regex functions that can be replaced with faster string functions, and finally i see if i can compact the code into less lines and less characters to allow faster interpreting and making it more sleek and compact.

gunbuster363 0 Junior Poster · Answer 4 · 2009-11-18T12:59:12+00:00

If you have very large HTML documents you have the option to parse only selected parts of the document. Here is an example (Python2 code) ...
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')
# create a list
a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

# show all the a_tag lines
for line in a_tags:
    print( line )
If you use Python2, you can also try to apply module psyco from:
http://psyco.sourceforge.net/
Psyco is a JIT i386 compiler that compiles to native i386 code rather than Python bytecode, displaying speed improvements of 3 - 10 fold.

this line return a list instead of a beautiful soup object....so I cannot use findall() on it....

a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

amrutraj 0 Newbie Poster · Answer 5 · 2010-05-11T04:49:48+00:00

Parsing a page with 8000+ urls with BeautifulSoup

this is the page

http://www.thehindubusinessline.com/cgi-bin/bl2002.pl?mainclass=03

this is my code

from urllib2 import URLError,urlopen
import re
from BeautifulSoup import BeautifulSoup, SoupStrainer

def gethtml(address):
 	try:
		raw=urlopen(address)
		raw=raw.read()
	except URLError:
		raw='Error occured'
	return raw
	
	
dat=gethtml("http://www.thehindubusinessline.com/cgi-bin/bl2002.pl?mainclass=03")
print 'got html'
a_tag=SoupStrainer('a')
html_atag = BeautifulSoup(dat, parseOnlyThese=a_tag)
print 'soup done'
linklist=html_atag.findAll('a',href=re.compile(r'stories'))

The last step, .findall , takes forever. Is there any other way to do it faster??

thanks.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2010-05-11T06:43:15+00:00

I do not know about this Beautifull soup and your url did not function for me, but I tried with other url your pickup of data by filtering by partition <a> tags from New York Times.

It seemed fast enough for me.

from urllib2 import URLError,urlopen
import re

def gethtml(address):
    try:
        raw=urlopen(address)
        raw=raw.read()
    except URLError:
        raw='Error occured'
    return raw
    
## your url did not function, so I put something more functional for me to test your program    
dat=gethtml("http://www.nytimes.com/")
print 'got html'
print 'length',len(dat)
##a_tag=SoupStrainer('a')
##html_atag = BeautifulSoup(dat, parseOnlyThese=a_tag)
##print 'soup done'
##linklist=html_atag.findAll('a',href=re.compile(r'stories'))
print 'simple filter with partition'
print '-'*80
rest=dat
find=' '
while find:
    start,find1,rest=rest.partition('<a')
    i,find,rest= rest.partition('</a>')
    if 'india' in i.lower() or 'asia' in i.lower():
        print find1+i+find
print '-'*80

amrutraj 0 Newbie Poster · Answer 7 · 2010-05-12T11:17:27+00:00

That website is closed for 2 hours everyday from 2:30 am to 4:30 am for updating the news.

Thanks tonyjv, you really went an extra mile there. That was very informative. I am very new to python < 1week and i am learning through hacking and duct taping, for now. An its going great. BeautifulSoup is awesome for newbies.
There was a minor problem with my code. vegaseat's code cleared it up.

For now I am gonna stick with BeautifulSoup . After his project I will dive in to hardcore Python.