I am coding a web spider for research purposes and have run into an error I am uncertain about. I am fairly new to web programming and need a bit of guidance. I use http.client to get a connection, request a site, get the response, and read the resonse into a variable. Then, using HTMLparser, I attempt to read() the variable, but am given this error:

Traceback (most recent call last):
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 45, in <module>
main()
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 41, in main
spider.feed(page)
File "C:\Python31\lib\html\parser.py", line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly

Any help would be very much appreciated. Thank you.

''
Created on May 26, 2009

@author: Steven Norris

This program runs as a spider for the the savannah.gnu.org to add information about
both the GNU projects and non-GNU projects to a database for further investigation.
'''
from html import parser
from http import client
import re

class SpiderSavannahProjectsList(parser.HTMLParser):
    
    
    check_links=[]
    
    def get_page(self, site, page):
        conn=client.HTTPConnection(site)
        conn.request("GET","http://"+site+page)
        resp=conn.getresponse()
        html_page=resp.read()
        return html_page
    
    def handle_starttag(self,tag,attrs):
        if tag=='a':
            link=attrs[0][1]
            if re.search('\.\./projects/',link)!=None:
                
                self.check_links.append(link)
    
    def add_to_database(self,links):
        for link in links:
            page=self.get_page('savannah.gnu.org',link[3:len(link)])
            #add page to database here.

def main():
    spider=SpiderSavannahProjectsList()
    page=spider.get_page('savannah.gnu.org','/search/?type_of_search=soft&words=%2A&type=1&offset=0&max_rows=400#results')
    print (page)
    spider.feed(page)
    for i in spider.check_links:
        print (i)
        
main()

Recommended Answers

All 3 Replies

You need to convert page to a string before passing it to feed(). This is because python 3 makes a clear separation between data and text, and now you must explicitly convert between them (feed expects text (a string) and page is data (bytes). You do this by calling the decode method on page with the suitable encoding:

string = page.decode(charset)

charset should be the encoding that the page was encoded with, or if you meet any non-ascii characters it's very likely to throw up on you. You can get the encoding in three ways:

1. Just inspect the gnu.org website and see if it uses a common character set across all pages, if it does just use that (your code won't port to other sites though).

2. Inspect the http header sent for the page, and use the charset it specifies. The problem is that the server sometimes lies about this, because a page may have a different charset than the server would normally use.

3. Use a common charset (like utf-8) and then turn back and start over if the page specifies a different charset in its html (not http) header.

I find the best approach is a mixture of 2 and 3; that is, use the one in the http header by default, starting over with the appropriate one if the page specifies its own.

I hope that wasn't all too cryptic. Have a look at the code for this class (it's written for Python 2, not 3, but still will help). It might help you understand how to do step 3

http://www.tejerodgers.com/snippets/wp-content/uploads/2009/01/webpage.zip

The example code link that was provided is no longer available. I'm trying to write python spider using python 3.1 and urllib with html.parser (HTMLParser class), and am running into this same issue.

Does anyone have this example code, or a solution?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.