Hi I am trying to pull some data from a Web site: http://schoolfinder.com

The issue is that I want to use the advanced search feature which requires logging into the Web site. I have a username and password, however I want to connect programmatically from Python. I have done data capture from the Web before so the only new thing here to me is the authentication stuff. I need cookies as this page describes: http://schoolfinder.com/login/login.asp

I already know how to enter POST/GET data to a request, but how do I deal with cookies/authentication? I have read a few articles without success:

urllib2:
http://www.voidspace.org.uk/python/articles/urllib2.shtml#id6

urllib2 Cookbook:
http://personalpages.tds.net/~kent37/kk/00010.html

basic authentication:
http://www.voidspace.org.uk/python/articles/authentication.shtml#id19

cookielib:
http://www.voidspace.org.uk/python/articles/cookielib.shtml

Is there some other resource I am missing? Is it possible that someone could setup a basic script that would allow me to connect to schoolfinder.com with my username and password? My username is "greenman", password is "greenman". All I need to know is how to access pages as if I logged in by Web browser.

Thank you very much.

Recommended Answers

All 4 Replies

The link "http://greenman:greenman@schoolfinder.com/" does not seem to log me into the Web site. Is that for basic authentication? I'm sure this Web site uses cookies somewhere, but I'm just not understanding how to deal with it.

I put your link into my browser (Safari, Camino, and Firefox) and it shows me as not logged in. Thanks for the help though.

If you resolve this issue, please post it here! I've been looking for the solution to the exact same problem.

If I find anything, I'll post it here as well. Thanks.

I was able to solve this problem after a lot more research and tinkering. Here is the solution for anyone interested.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import cookielib
import urllib
import urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
resp = opener.open('http://schoolfinder.com') # save a cookie

theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
body={'usr':'greenman','pwd':'greenman'}
txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration


try:
    req = urllib2.Request(theurl, txdata, txheaders) # create a request object
    handle = opener.open(req) # and open it to return a handle on the url
    HTMLSource = handle.read()
    f = file('test.html', 'w')
    f.write(HTMLSource)
    f.close()

except IOError, e:
    print 'We failed to open "%s".' % theurl
    if hasattr(e, 'code'):
        print 'We failed with error code - %s.' % e.code
    elif hasattr(e, 'reason'):
        print "The error object has the following 'reason' attribute :", e.reason
        print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
        sys.exit()

else:
    print 'Here are the headers of the page :'
    print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.