4
Contributors
11
Replies
12
Views
7 Years
Discussion Span
Last Post by vegaseat
0

Here is a way to get the addresses of the images

# extract the addresses of the images included in a web page
# you need the modules lxml and beautifulsoup
# (for linux, packages python-lxml and python-beautifulsoup)
# tested with python 2.6 
from lxml.html import soupparser
from urllib2 import urlopen

def gen_elements(tag, root):
    if root.tag == tag:
        yield root
    for child in root:
        for elt in gen_elements(tag, child):
            yield elt

def gen_img_src(url):
    content = urlopen(url).read()
    content = soupparser.fromstring(content)
    for elt in gen_elements("img", content):
        yield elt.attrib.get("src", None)

def main():
    url = "http://www.it.usyd.edu.au/about/people/staff/tomc.shtml"
    for src in gen_img_src(url):
        print(src)

if __name__ == "__main__":
    main()
0

OK - that's an overly complex answer for a question which requires a simple answer.

Right-click on image and save.... nothing more complex required

@Gribouillis - bear in mind that if the OP is even asking how to source the original image, it makes me wonder if he is even the owner of said image!

Edited by kaninelupus: n/a

0

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

0

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

As I said, you need the lxml module.
@kaninelupus. there are different ways to understand the question. I don't think my solution is complex.

0

Go here and download "Beautiful Soup version 3.1.0.1". This is a compressed .tar.gz file. To uncompress it with windows, you'll need 7zip from here. Right click on the BeautifulSoup.tar.gz and uncompress once, which should give you a file with suffix .tar, uncompress a second time and you should get a folder BeautifulSoup. Then copy BeautifulSoup.py to the site-packages directory in your python lib. It should work then.

0

thx...how about beatifulsoup,how to install it?

http://www.crummy.com/software/BeautifulSoup/

Somewhat simple minded, but you could try this and go from there ...

# retrieve the html code of a given website
# and check for potential image sources
# tested with Python 2.5.4

import urllib2

def extract(text, sub1, sub2):
    """
    extract a substring from text between first
    occurances of substrings sub1 and sub2
    """
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]


url_str = 'http://www.it.usyd.edu.au/about/people/staff/tomc.shtml'
fin = urllib2.urlopen(url_str)
html = fin.read()
fin.close()
  
#print(html)  # test
html = html.lower()

while True:
    try:
        s = extract(html, '<img src=', '/>')
        print s
        if not s:
            break
        pos = html.find(s) + len(s)
        # slice to potential next image
        html = html[pos:]
    except:
        break

Edited by vegaseat: n/a

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.