I have written a script for scraping a web site, and it works fine. What does not work fine, is when I try to use the write function, to write the results to a txt-file.

I am trying to run this:

import BeautifulSoup, urllib2, re, time
import codecs

path='C:/Users/Me/Documents/Python'

outfile=open(r'C:/Users/Steinar/Documents/Python/Vegvesen/vegresultat.txt', 'a')

start_url = "http://www.vegvesen.no/Om+Statens+vegvesen/Aktuelt/Offentlig+journal?dokumenttyper=&dato=10.02.2003&journalenhet=&utforSok=S%C3%B8k&submitButton=S%C3%B8k"

datos = (
'01.11.2008',
'02.11.2008',

)

for dato in datos:
    search_url = "http://www.vegvesen.no/Om+Statens+vegvesen/Aktuelt/Offentlig+journal?dokumenttyper=&dato=%s&journalenhet=&utforSok=S%%C3%%B8k&submitButton=S%%C3%%B8k" % dato
    page = urllib2.urlopen(search_url)
    html = page.read()
    soup = BeautifulSoup.BeautifulSoup(html)
    divs = soup.findAll("div", {"class": "treff"})
    for div in divs:
        outfile.write (dato + '|' + div.p.contents[0])

        pass

    outfile.close()

I get this error message:

outfile.write (dato + '|' + div.p.contents[0])
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 25: ordinal not in range(128)

BeautifulSoup stores everything as unicode strings.
In order to convert BeautifulSoup's unicode strings to human readable strings, you have to encode() them; and when you encode() a unicode string, you have to specify a "codec". Some examples of codecs are "ascii" and "utf-8".
Se if this help.

divs = soup.findAll("div", {"class": "treff"})
for div in divs:
    outfile.write (div.encode('utf-8','replace'))
    print div  #test print

pass

outfile.close()

div.p.contents[0] this i dont get to work.
And use code tags.

BeautifulSoup stores everything as unicode strings.
In order to convert BeautifulSoup's unicode strings to human readable strings, you have to encode() them; and when you encode() a unicode string, you have to specify a "codec". Some examples of codecs are "ascii" and "utf-8".
Se if this help.

divs = soup.findAll("div", {"class": "treff"})
for div in divs:
    outfile.write (div.encode('utf-8','replace'))
    print div  #test print

pass

outfile.close()

div.p.contents[0] this i dont get to work.
And use code tags.

I still can't find a way to write this to file. I will post my script here with the print command, witch work. What I am trying to find out is how i can rewrite this script so that it write to a file instead of printing in the Python Shell.


import BeautifulSoup, urllib2, re, time
import codecs

start_url = "http://www.vegvesen.no/Om+Statens+vegvesen/Aktuelt/Offentlig+journal?dokumenttyper=&dato=10.02.2003&journalenhet=&utforSok=S%C3%B8k&submitButton=S%C3%B8k"

datos = (
'01.11.2008',
'02.11.2008',

)

for dato in datos:
search_url = "http://www.vegvesen.no/Om+Statens+vegvesen/Aktuelt/Offentlig+journal?dokumenttyper=&dato=%s&journalenhet=&utforSok=S%%C3%%B8k&submitButton=S%%C3%%B8k" % dato
page = urllib2.urlopen(search_url)
html = page.read()
soup = BeautifulSoup.BeautifulSoup(html)
divs = soup.findAll("div", {"class": "treff"})
for div in divs:
print dato + '|' + div.p.contents[0]
pass

First open a file for writing

outFile = open('/home/figved/file.txt','w')

Then replace your print statement with this. I have never used BeautifulSoup so I don't know about the encoding part. I'm just making my best guess based on the code above. But you're asking how to write to a file so this should push you in the right direction.

outFile.write(datao + '|' + div.p.contents[0].encode('utf-8','replace'))

First open a file for writing

outFile = open('/home/figved/file.txt','w')

Then replace your print statement with this. I have never used BeautifulSoup so I don't know about the encoding part. I'm just making my best guess based on the code above. But you're asking how to write to a file so this should push you in the right direction.

outFile.write(datao + '|' + div.p.contents[0].encode('utf-8','replace'))

Thanks guys! I am probably getting nearer..
But the line

outfile.write(dato + '|' + div.p.contents[0].encode('utf-8','replace'))

give me this error message:
TypeError: encode() takes at most 2 arguments (3 given)


Any ideas?

I see that this is about "scraping a website". Do we not care if that would infringe someone's copyright? Or at least if it be ethically questionable? I don't want to sound straitlaced, but wouldn't it be better we knew more about it before offering help with this?

I see that this is about "scraping a website". Do we not care if that would infringe someone's copyright? Or at least if it be ethically questionable? I don't want to sound straitlaced, but wouldn't it be better we knew more about it before offering help with this?

When you don't want people to download your documents, you don't put them online, so no, I don't think it's questionable. Thousands of programs extract data from websites. Why would figved's program be more questionable than mozilla firefox ?

When you don't want people to download your documents, you don't put them online, so no, I don't think it's questionable. Thousands of programs extract data from websites. Why would figved's program be more questionable than mozilla firefox?

Frankly, that is a bemusing view by itself. Copyrights don't seize to exist online (just because they are harder to enforce there).

Not that I'm a big fan of copyrights. But how about copylefts? About giving credit where credit is due?

My concern is not about downloading stuff, but about mis-presenting it's origin. A web browser by itself does not obscure the source of the data. When you view a web site, you see their content under their identity, their "branding". Website scraping is a technology used to automatically capture data from all over the web in order to present it under a different identity and branding - mostly for the purpose of quickly building ad revenue at the expense of the original authors. That's definitely not only unethical, but also mostly illegal.

I'm not assuming figved's scraping purpose is abusive, but when I see "website scraping", my alarm bells go on. I just think some kind of reassuring statement about it would be favorable.

I think it's pure paranoia to suspect someone from stealing someone else's property just because he is extracting data from their web page. Why not jail people who post in the web development forums ?

I think it's pure paranoia to suspect someone from stealing someone else's property just because he is extracting data from their web page. Why not jail people who post in the web development forums?

Gribouillis, you are being offensive, and I'm not going to take it! Who are you to "diagnose" paranoia? If I wanted to consult a psychologist, I would have asked one!

I could continue to argue on the subject, but I won't do that with you! I expect a certain minimum level of good behavior in discussions, and you are definitely not meeting it right now!

commented: minimum level +0

Gribouillis, that you have downvoted my reputation instead of correcting your disrespectful tone, or coming up with any substantiated arguments (of which you have brought up none so far on this matter), only proves how immature your level of interaction is! I'm sorry you think you must win an argument by sneaky methods like that. If you can't respect other views than yours, ask yourself if a public forum is the right place for you (no matter how many posts you make). No go on at downvote me again!

Gribouillis, that you have downvoted my reputation instead of correcting your disrespectful tone, or coming up with any substantiated arguments (of which you have brought up none so far on this matter), only proves how immature your level of interaction is! I'm sorry you think you must win an argument by sneaky methods like that. If you can't respect other views than yours, ask yourself if a public forum is the right place for you (no matter how many posts you make). No go on at downvote me again!

I don't mean to win anything, I think this is a place to discuss python programming, with people having different levels, who like and learn this language. Hundreds of threads here are about extracting data from html sources, and I've never heard anyone suggest that we shouldn't help because of copyrights.

I don't mean to win anything, I think this is a place to discuss python programming, with people having different levels, who like and learn this language. Hundreds of threads here are about extracting data from html sources, and I've never heard anyone suggest that we shouldn't help because of copyrights.

You are not fooling me! Downvoting my reputation on the base of my views and positions, and labeling it "minimum level" (thereby suggesting my competence was low, which it clearly isn't) was a clear act of aggressiveness. It was an act of an enraged cowered.

I have seen through your opportunistic game. Do you think I don't recognize what you are up to with arguments like comparing website scraping with Firefox? When I responded to it on a logical ground, you came up with "Why not jail people who post in the web development forums" (your disrespectful tone set aside). None of these had the slightest substance or contribution to the issue I brought in. You were not engaging in any constructive discussion, just "bashing" around with words.

For the records: Do you think you can quote me wrong and use that as an argument against me? I never said "we shouldn't help". I said we should be responsible about what we are helping with. Being the first to think about this makes me original. It doesn't make you right!

No, you can't fool me you were not trying to "win a fight". Because you were. It was your decision to apply sneaky, unfair measures instead of helping me resolve my issue (which is why we are all here, aren't we?). And that is a real shame.

Now move on, and get happy with what you did!

lol

I think it's pure paranoia to suspect someone from stealing someone else's property just because he is extracting data from their web page. Why not jail people who post in the web development forums ?

Hello people!

I work as an investigative reporter and all I am trying to do is to download some data from a gourvernment homepage, data whitch already is online. Quite a few journalists around the world use this tecnique when they struggle to get the information that is already online delivered as a data file to do research on their story.

Hello people!

I work as an investigative reporter and all I am trying to do is to download some data from a gourvernment homepage, data whitch already is online. Quite a few journalists around the world use this tecnique when they struggle to get the information that is already online delivered as a data file to do research on their story.

Good luck with your struggle !

Hello people!

I work as an investigative reporter and all I am trying to do is to download some data from a gourvernment homepage, data whitch already is online. Quite a few journalists around the world use this technique when they struggle to get the information that is already online delivered as a data file to do research on their story.

That's all I need to know.

To circumvent the exception, you can omit 'replace'.

outfile.write (dato + '|' + div.p.contents[0].encode('utf8'))

What also works for me is

outfile.write (dato + '|' + str(div.p.contents[0]))

The fact that the extra parameter, 'replace', raises an exception is because you are probably using a version of BeautifulSoup that has a glitch. (It works with my version 3.0.7a.)

I'd also suggest to add a line break at the end of each entry, like

outfile.write (dato + '|' + str(div.p.contents[0]) + '\n')
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.