problem parsing webpage using BeautifulSoup

Question

hemant_rajput 0 Newbie Poster

12 Years Ago

Hi, i've used the Beautifulsoup module to parse the site and grab the img tag from it, but the problem is , Beautifulsoup while parsing not returning the whole content of the given url. The truncated content contain the image location I want to download:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup


#reading the webpage source
webpage = urlopen('http://www.santabanta.com/photos/aalesha/10066001.htm').read()

#putting all the webpage content into variable named soup using beautifulsoup
soup = BeautifulSoup(''.join(webpage))
print soup

#finding all the img tags
imagelocation = soup.findAll('img')

#printing the img content
for i in imagelocation:
    print i

I want to extract the following link "http://media1.santabanta.com/full5/indian celebrities(f)/aalesha/aalesha-1a.jpg". If you will see the source code of webpage you will find <img > tag at line no. 234 but it is not present after parsing it with beautifulsoup. when i do soup.prettify() i'll get whole webpage parsed otherwise some fields are missing. Can someone tell me what is that i'm doing wrong.

beautiful-soup images parse python

Edited 12 Years Ago by hemant_rajput because: n/a

3 Contributors
4 Replies
446 Views
2 Days Discussion Span
Latest Post 12 Years Ago Latest Post by Gribouillis

All 4 Replies

snippsat 661 Master Poster

12 Years Ago

I want to extract the following link "http://media1.santabanta.com/full5/indian celebrities(f)/aalesha/aalesha-1a.jpg".

There is a problem the link you want is loaded bye javascript.
We can see the link in downloaded text,then we can drop to simulate javascript and use regex(because Beautifulsoup cant find stuff in javascript)

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen('http://www.santabanta.com/photos/aalesha/10066001.htm')
soup = BeautifulSoup(webpage)
#print soup

bac_img = re.search(r"""backgroundImage="url\('(.*)'""", str(soup))
print bac_img.group(1)
#http://media1.santabanta.com/full1/Indian  Celebrities(F)/Aalesha/aalesha-1a.jpg

#Example of how to print image location,that is not loaded bye javascript
'''
imagelocation = soup.findAll('img')
for imgTag in imagelocation:
    print imgTag['src']'''

Edited 12 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

hemant_rajput 0 Newbie Poster · Answer 1 · 2011-11-12T16:43:18+00:00

There is a problem the link you want is loaded bye javascript.
We can see the link in downloaded text,then we can drop to simulate javascript and use regex(because Beautifulsoup cant find stuff in javascript)

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen('http://www.santabanta.com/photos/aalesha/10066001.htm')
soup = BeautifulSoup(webpage)
#print soup

bac_img = re.search(r"""backgroundImage="url\('(.*)'""", str(soup))
print bac_img.group(1)
#http://media1.santabanta.com/full1/Indian  Celebrities(F)/Aalesha/aalesha-1a.jpg

#Example of how to print image location,that is not loaded bye javascript
'''
imagelocation = soup.findAll('img'). Also as I'm going to iterate it over with the next button so, I want to retrieve the max size in which that image is avialable
for imgTag in imagelocation:
    print imgTag['src']'''

Actually the link that your code is retrieving is not of the size I wanted. you are retriving "http://media1.santabanta.com/full1/Indian Celebrities(F)/Aalesha/aalesha-1a.jpg" and I want "http://media1.santabanta.com/full5/indian celebrities(f)/aalesha/aalesha-1a.jpg". I'm acutally going to itereate it over every single wallpaper available on that site so I just can't replace the full1 by full5. Also I wanted to know is that hyperlink in next button also written in javascript or simply html.

hemant_rajput 0 Newbie Poster · Answer 2 · 2011-11-14T17:56:41+00:00

Can any one explain me why I'm not able to get the following line of code when reading the html webpage using urlopen. The source is taken from this url http://www.santabanta.com/photos/aalesha/10066001.htm.Below code is html code

<tr><td align=center>
<img src="http://media1.santabanta.com/full2/indian  celebrities(f)/aalesha/aalesha-1a.jpg" id="wall" border="0" align="middle"  width="1000"  onload="Loading.style.display='none';sendthis.style.display='inline';error1.style.display='none';" onError="change_addr()" alt="Aalesha" title="Aalesha" />
<center><font style="padding-top: 5px;font-weight:bold;font-size:12pt;"> Aalesha </font></center>

</td></tr>

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2011-11-14T18:06:11+00:00

I can't explain, but perhaps you could try to browse the site with the mechanize module which offers more browsing options.

problem parsing webpage using BeautifulSoup

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers