I made a scraper for a web site, but I'm having problems runninf my code...

#!/usr/bin/env python

from bs4 import BeautifulSoup
import urllib2
import re

# Get the links...

html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()

links = re.findall(r'''<a\s+.*?href=['"](.*?)['"].*?(?:</a|/)>''', html, re.I)

links_range = links[6:len(links)]


# Scrape and append the output...
f = open("test.html", "a")

for link in links_range:
    html = urllib2.urlopen('http://www.blah.fi/' + link).read()
    soup = BeautifulSoup(open(html))
    content = soup.find(id="content") 
    f.write(content.encode('utf-8') + '<hr>')


f.close()

Here is the error...

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

If I remove the 'for' loop and run a single instance of a page, it runs correctly.
What does the error mean?

Edited 4 Years Ago by r3bol

The error message doesn't make any sense as it only references line 3

Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

which is

from bs4 import BeautifulSoup

Is BeautifulSoup installed correctly?

I installed it via the ubuntu package manager. I can get other output from it, for example...

f = open("test.html", "a")
html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()
soup = BeautifulSoup(open(html))
content = soup.find(id="content")
f.write(content.encode('utf-8') + '<hr>')
f.close()

I'm not really sure how to test a python library for successful installation though.

The following works for me. Perhaps there is no id="content" for the site you use.

import urllib2
from bs4 import BeautifulSoup

f = open("test.html", "a")
html = urllib2.urlopen('http://www.google.com').read()
soup = BeautifulSoup(html)
content = soup.find(id="csi")
print "content", content
f.write(content.encode('utf-8') + '<hr>')
f.close()

Edited 4 Years Ago by woooee

You are doing some strange stuff.
Do you get urllib to work for this site?

import urllib2

url = "http://www.blah.fi/"
read_url = urllib2.urlopen(url).read()
print read_url #403 error

This site are blocking use of urllib.
I had to use Requests to get source code.

You can use BeautifulSoup to get all link,no need to use regex.

import requests
from bs4 import BeautifulSoup

url = "http://www.blah.fi/"

url_read = requests.post(url)
soup = BeautifulSoup(url_read.content)
links = soup.find_all('a', href=True)
for link in links:
    print link['href']

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Will give you output like this.

>>> 'http://www.blah.fi/' + 'http://v-reality.info/' + '<hr>'
'http://www.blah.fi/http://v-reality.info/<hr>'

soup = BeautifulSoup(open(html)) not this way,the normal way is.

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url) 
tag = soup.find_all(Do stuff you want)

IOError: [Errno 36] File name too long:
The error is pretty clear,put i do not understand that it come from line 3.
A filename can not be longer than 255 character.
Look at output from content = soup.find(id="content")

print content
print type(content)
print repr(content)

I'm not really sure how to test a python library for successful installation though.

import bs4
>>> bs4.__version__
'4.1.0'
>>> from bs4 import BeautifulSoup
>>> print BeautifulSoup.__doc__
#You get description if it work

Edited 4 Years Ago by snippsat

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Lol, no. That's just a fake site. I didn't want to mention the real site I'm scraping ;)

You were totally right about this part...

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url)
tag = soup.find_all(Do stuff you want)

I've been piecing together bits from various tutorials, but it somehow looked like it was working when I was trying small snippets out.

Thanks for your help.

This article has been dead for over six months. Start a new discussion instead.