webscraping with beautiful soup (extracting images)

Question

bkjfdghiuds 0 Newbie Poster

12 Years Ago

hello

I am following this tutorial on how to scrap website information

http://www.newthinktank.com/2010/11/pyt ... -scraping/

this is my code:
EDIT: do not post off site, moved here

#! /usr/bin/python

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

def cleanHtml(i):
    i = str(i) # Convert the Beautiful Soup Tag to a string
    bS = BeautifulSoup(i) # Pass the string to Beautiful Soup to strip out html


    # Find all of the text between paragraph tags and strip out the html
    i = bS.find('p').getText() 

    # Strip ampersand codes and WATCH:
    i = re.sub('&\w+;','',i)
    i = re.sub('WATCH:','',i)
    return i

def cleanHtmlRegex(i):
    i = str(i)
    regexPatClean = re.compile(r'<[^<]*?/?>')
    i = regexPatClean.sub('', i) 
    # Strip ampersand codes and WATCH:

    i = re.sub('&\w+;','',i)
    return re.sub('WATCH:','',i)


# Copy all of the content from the provided web page
webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

# Grab everything that lies between the title tags using a REGEX
titleString = "<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />"
patFinderTitle = re.compile(titleString)


# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)


# Print out the results to screen
for i in listIterator:
    print findPatTitle[i] # The title
    print "\n"

the only parts of the code i've changed is this

webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

titleString = '<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />'
patFinderTitle = re.compile(titleString)

I would like to create something that will be able to extract all pictures from every page but for now, im just trying to pull any jpgs but i cant seem to figure it out.

can someone help? i am new to coding.

python

Edited 12 Years Ago by TrustyTony because: Moved your code insite

3 Contributors
2 Replies
2K Views
8 Hours Discussion Span
Latest Post 12 Years Ago Latest Post by snippsat

TrustyTony 888 ex-Moderator

12 Years Ago

titleString = '<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />'

Use double quotes to include the single quotes inside the string (I fixed it when moving your code here)

But listIterator is not defined, what is it?

Maybe

# Print out the results to screen
for t in findPatTitle:
    print t # The title

At least when I did

#! /usr/bin/python
import os
import webbrowser

import contextlib
from urllib import urlopen
import re

# Copy all of the content from the provided web page
webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

# Grab everything that lies between the title tags using a REGEX
titleString = "<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />"
patFinderTitle = re.compile(titleString)

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)

# Print out the results to screen
for t in findPatTitle:
    pic_name = os.path.basename(t)

    with contextlib.closing(urlopen(t, 'rb')) as pic:
        p = pic.read()
        if len(p) > 1000:
            print t # The title
            with open(pic_name, 'wb') as of:
                of.write(p)
            webbrowser.open(pic_name)

Pictures loaded OK (short ones would have not loaded)

By the way, you are not actually using beautifulsoup, you are not using the functions defined also but doing regular expression search for html, which is considered bad, bad, thing.

Edited 12 Years Ago by TrustyTony

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

snippsat 661 Master Poster · Answer 1 · 2012-05-16T16:40:58+00:00

The tutorial about BeautifulSoup is not so good.
The use of regex is not needed,let BeautifulSoup do the job.
Regex with html is not so good,you can mix in regex some time to do a little cleaning.
But here it`s not needed.

To get all picture from this link.

from BeautifulSoup import BeautifulSoup
import urllib2

url = urllib2.urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500')
soup = BeautifulSoup(url)
links = soup.findAll('img', src=True)
for link in links:
    print link['src']

Why is regex not good with html/xml?,i usually post this link.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags