File name too long error

Question

r3bol 0 Newbie Poster

12 Years Ago

I made a scraper for a web site, but I'm having problems runninf my code...

#!/usr/bin/env python

from bs4 import BeautifulSoup
import urllib2
import re

# Get the links...

html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()

links = re.findall(r'''<a\s+.*?href=['"](.*?)['"].*?(?:</a|/)>''', html, re.I)

links_range = links[6:len(links)]


# Scrape and append the output...
f = open("test.html", "a")

for link in links_range:
    html = urllib2.urlopen('http://www.blah.fi/' + link).read()
    soup = BeautifulSoup(open(html))
    content = soup.find(id="content") 
    f.write(content.encode('utf-8') + '<hr>')


f.close()

Here is the error...

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

If I remove the 'for' loop and run a single instance of a page, it runs correctly.
What does the error mean?

file-system python

Edited 12 Years Ago by r3bol

4 Contributors
6 Replies
6K Views
5 Years Discussion Span
Latest Post 6 Years Ago Latest Post by happygeek

All 6 Replies

woooee 814 Nearly a Posting Maven

12 Years Ago

The error message doesn't make any sense as it only references line 3

Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

which is

from bs4 import BeautifulSoup

Is BeautifulSoup installed correctly?

snippsat 661 Master Poster

12 Years Ago

You are doing some strange stuff.
Do you get urllib to work for this site?

import urllib2

url = "http://www.blah.fi/"
read_url = urllib2.urlopen(url).read()
print read_url #403 error

This site are blocking use of urllib.
I had to use Requests to get source code.

You can use BeautifulSoup to get all link,no need to use regex.

import requests
from bs4 import BeautifulSoup

url = "http://www.blah.fi/"

url_read = requests.post(url)
soup = BeautifulSoup(url_read.content)
links = soup.find_all('a', href=True)
for link in links:
    print link['href']

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Will give you output like this.

>>> 'http://www.blah.fi/' + 'http://v-reality.info/' + '<hr>'
'http://www.blah.fi/http://v-reality.info/<hr>'

soup = BeautifulSoup(open(html)) not this way,the normal way is.

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url) 
tag = soup.find_all(Do stuff you want)

IOError: [Errno 36] File name too long:
The error is pretty clear,put i do not understand that it come from line 3.
A filename can not be longer than 255 character.
Look at output from content = soup.find(id="content")

print content
print type(content)
print repr(content)

I'm not really sure how to test a python library for successful installation though.

import bs4
>>> bs4.__version__
'4.1.0'
>>> from bs4 import BeautifulSoup
>>> print BeautifulSoup.__doc__
#You get description if it work

Edited 12 Years Ago by snippsat

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

r3bol 0 Newbie Poster · Answer 1 · 2012-11-05T19:23:34+00:00

I installed it via the ubuntu package manager. I can get other output from it, for example...

f = open("test.html", "a")
html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()
soup = BeautifulSoup(open(html))
content = soup.find(id="content")
f.write(content.encode('utf-8') + '<hr>')
f.close()

I'm not really sure how to test a python library for successful installation though.

woooee 814 Nearly a Posting Maven · Answer 2 · 2012-11-05T19:59:23+00:00

The following works for me. Perhaps there is no id="content" for the site you use.

import urllib2
from bs4 import BeautifulSoup

f = open("test.html", "a")
html = urllib2.urlopen('http://www.google.com').read()
soup = BeautifulSoup(html)
content = soup.find(id="csi")
print "content", content
f.write(content.encode('utf-8') + '<hr>')
f.close()

r3bol 0 Newbie Poster · Answer 3 · 2012-11-05T21:56:13+00:00

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Lol, no. That's just a fake site. I didn't want to mention the real site I'm scraping ;)

You were totally right about this part...

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url)
tag = soup.find_all(Do stuff you want)

I've been piecing together bits from various tutorials, but it somehow looked like it was working when I was trying small snippets out.

Thanks for your help.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 4 · 2018-09-13T08:05:03+00:00

If the poster is still working on this, some five years later, I'd like to suggest his problems probably run deeper than any application can help with...

File name too long error

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers