1.11M Members

webscraping with beautiful soup (extracting images)

 
0
 

hello

I am following this tutorial on how to scrap website information

http://www.newthinktank.com/2010/11/pyt ... -scraping/

this is my code:
EDIT: do not post off site, moved here

#! /usr/bin/python

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

def cleanHtml(i):
    i = str(i) # Convert the Beautiful Soup Tag to a string
    bS = BeautifulSoup(i) # Pass the string to Beautiful Soup to strip out html


    # Find all of the text between paragraph tags and strip out the html
    i = bS.find('p').getText() 

    # Strip ampersand codes and WATCH:
    i = re.sub('&\w+;','',i)
    i = re.sub('WATCH:','',i)
    return i

def cleanHtmlRegex(i):
    i = str(i)
    regexPatClean = re.compile(r'<[^<]*?/?>')
    i = regexPatClean.sub('', i) 
    # Strip ampersand codes and WATCH:

    i = re.sub('&\w+;','',i)
    return re.sub('WATCH:','',i)


# Copy all of the content from the provided web page
webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

# Grab everything that lies between the title tags using a REGEX
titleString = "<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />"
patFinderTitle = re.compile(titleString)


# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)


# Print out the results to screen
for i in listIterator:
    print findPatTitle[i] # The title
    print "\n"

the only parts of the code i've changed is this

webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

titleString = '<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />'
patFinderTitle = re.compile(titleString)

I would like to create something that will be able to extract all pictures from every page but for now, im just trying to pull any jpgs but i cant seem to figure it out.

can someone help? i am new to coding.

 
0
 

titleString = '<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />'

Use double quotes to include the single quotes inside the string (I fixed it when moving your code here)

But listIterator is not defined, what is it?

Maybe

# Print out the results to screen
for t in findPatTitle:
    print t # The title

At least when I did

#! /usr/bin/python
import os
import webbrowser

import contextlib
from urllib import urlopen
import re

# Copy all of the content from the provided web page
webpage = urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500').read()

# Grab everything that lies between the title tags using a REGEX
titleString = "<span rel='lightbox'><img src='(.*)' alt='Posted Image' class='bbc_img' />"
patFinderTitle = re.compile(titleString)

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)

# Print out the results to screen
for t in findPatTitle:
    pic_name = os.path.basename(t)

    with contextlib.closing(urlopen(t, 'rb')) as pic:
        p = pic.read()
        if len(p) > 1000:
            print t # The title
            with open(pic_name, 'wb') as of:
                of.write(p)
            webbrowser.open(pic_name)

Pictures loaded OK (short ones would have not loaded)

By the way, you are not actually using beautifulsoup, you are not using the functions defined also but doing regular expression search for html, which is considered bad, bad, thing.

 
0
 

The tutorial about BeautifulSoup is not so good.
The use of regex is not needed,let BeautifulSoup do the job.
Regex with html is not so good,you can mix in regex some time to do a little cleaning.
But here it`s not needed.

To get all picture from this link.

from BeautifulSoup import BeautifulSoup
import urllib2

url = urllib2.urlopen('http://supertalk.superfuture.com/index.php?/topic/95817-2-for-1-pics/page__st__500')
soup = BeautifulSoup(url)
links = soup.findAll('img', src=True)
for link in links:
    print link['src']

Why is regex not good with html/xml?,i usually post this link.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

You
This article has been dead for over six months: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article