Hello, me again :)
With this code:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> url = urllib2.urlopen('http://www.python.org').read()
>>> soup = BeautifulSoup(url)
>>> links = soup('a')
>>> print links

A list of links printed into the terminal. I want to send the list into a text file, i tried this:

>>> with open('python-links.txt.', 'w') as f:
...     f.write(links)

But there was an error:

  File "<stdin>", line 2, in <module>
TypeError: expected a character buffer object
What is the problem? How can fix that?

And one more question; as that list looks like this: (I will copy only small part of the list)

[<a href="#content" title="Skip to content">Skip to content</a>, <a id="close-python-network" class="jump-link" href="#python-network" aria-hidden="true">
<span aria-hidden="true" class="icon-arrow-down"><span>&#9660;</span></span> Close
                </a>, <a href="/" title="The Python Programming Language" class="current_item selectedcurrent_branch selected">Python</a>, <a href="/psf-landing/" title="The Python Software Foundation">PSF</a>,

So how can i drop each link into a new line?
I tried this:

>>> text = '\n'.join(links)

But i got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, Tag found

How can i do that?

Recommended Answers

All 3 Replies

Python complains because the file's write() method needs a string argument. Here the correct way to handle things is to find the values of the href= attributes, which contain the link targets. If you want to write anything to the file, you can use write(str(anything)).

Use the new bs4,do not call old BeautifulSoup.
Do not use read(),BeautifulSoup detect encoding and convert to Unicode.

As mention you need take out href attributes,
and you most learn to study webpage with Firebug or Chrome DevTools.
So then you see that you only need adresses that start with http and have href attributes.

from bs4 import BeautifulSoup # Use bs4
import urllib2

url = urllib2.urlopen('http://www.python.org') # Do not call read()
soup = BeautifulSoup(url)
with open('python-links.txt.', 'w') as f:
    for link in soup.find_all('a'):
        if link['href'].startswith('http'):
            f.write('{}\n'.format(link['href']))

Thank you @Grebouillis.

Thank you @snippsat.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.