I want to crawl my gf's xanga's post into my computer for better reading
but it require me to login before viewing the post

I am wondering ,can python crawl this password protected webpage?
I already have the id and password, because that is my id.

the login webpage, for example, is like this:
http://gunbuster363.xanga.com/XangaLock.aspx?user=gunbuster363&ReturnUrl=http://www.xanga.com/archives/2004/12

after entering the password and login, the page is :
http://gunbuster363.xanga.com/archives/2004/12/


Thanks all!

Recommended Answers

All 5 Replies

The concept is the same as in this Tech B snippet.

Give it a try,

Happy coding.

I think I cannot login properly.
Can anybody help me out?

this is my code

import urllib, urllib2, cookielib

#cookie storage
cj = cookielib.CookieJar()
#create an opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
#Add useragent, sites don't like to interact programs.
opener.addheaders.append(('User-agent', 'Mozilla/4.0'))
opener.addheaders.append( ('Referer', 'http://www.hellboundhackers.org/index.php') )

#encode the login data. This will vary from site to site.
#View the sites source code
#Example###############################################
#<form id='loginform' method='post' action='index.php'>
#<div style="text-align: center;">
#Username<br />
#<input type='text' name='user_name' class='textbox' style='width:100px' /><br />
#Password<br />
#<input type='password' name='user_pass' class='textbox' style='width:100px' /><br />
#<input type='checkbox' name='remember_me' value='y' />Remember Me<br /><br />
#<input type='submit' name='login' value='Login' class='button' /><br />

login_data = urllib.urlencode({'XangaHeader$txtSigninUsername' : 'ZZZZZZZZZ',
                               'XangaHeader$txtSigninPassword' : 'ZZZZZZZZZ',
                               'signin' : 'Sign In'
                               })

resp = opener.open('http://holly-hahaha.xanga.com/?nextdate=10/21/2005', login_data)
#you are now logged in and can access "members only" content.
#when your all done be sure to close it

print resp.read()

resp.close()

and this the part of the website for login

<form id="SigninForm" class="Form1" method="post" action="http://holly-hahaha.xanga.com/XangaLock.aspx?user=holly_hahaha&ReturnUrl=http%3a%2f%2fholly-hahaha.xanga.com%2fhome.aspx%3fuser%3dholly_hahaha%26nextdate%3d10%2f21%2f2005">
<input name="IsPostBack" type="hidden" id="IsPostBack" />
<ul class="list details-only">
<li class="item item-1 item-odd">
<div class="details">
<h4 class="itemtitle"><label for="XangaHeader_txtSigninUsername">Username</label></h4>
<div class="itembody">
<input name="XangaHeader$txtSigninUsername" type="text" id="XangaHeader_txtSigninUsername" maxlength="100" onmouseover="this.className='over';" onmouseout="this.className='';" onfocus="this.className='over';" onblur="this.className='';" tabindex="1" />
</div>
</div>
</li>
<li class="item item-2 item-even">
<div class="details">
<h4 class="itemtitle"><label for="XangaHeader_txtSigninPassword">Password</label></h4>
<div class="itembody">
<input name="XangaHeader$txtSigninPassword" type="password" id="XangaHeader_txtSigninPassword" maxlength="16" onkeypress="return SigninOnEnter(event);" onmouseover="this.className='over';" onmouseout="this.className='';" onfocus="this.className='over';" onblur="this.className='';" tabindex="2" />
<a id="signin" href="javascript: SigninSubmit();" tabindex="3">Sign In</a>
</div>
</div>
</li>
</ul>

I know I am wrong because what I got from the python shell is the page that require me to login

I think your line 28 should have the login address page, and not the page you are trying to read later.

I think your line 28 should have the login address page, and not the page you are trying to read later.

I don't know.
The fact is, if I is not login-ed, it always show the login page
which is "http://holly-hahaha.xanga.com/XangaLock.aspx?user=holly_hahaha&ReturnUrl=http://holly-hahaha.xanga.com/home.aspx%3fuser%3dholly_hahaha%26nextdate%3d10/21/2005"

I have tried to used this page to login,

import urllib, urllib2, cookielib

#cookie storage
cj = cookielib.CookieJar()
#create an opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
#Add useragent, sites don't like to interact programs.
opener.addheaders.append(('User-agent', 'Mozilla/4.0'))
opener.addheaders.append( ('Referer', 'http://www.hellboundhackers.org/index.php') )

#encode the login data. This will vary from site to site.
#View the sites source code
#Example###############################################
#<form id='loginform' method='post' action='index.php'>
#<div style="text-align: center;">
#Username<br />
#<input type='text' name='user_name' class='textbox' style='width:100px' /><br />
#Password<br />
#<input type='password' name='user_pass' class='textbox' style='width:100px' /><br />
#<input type='checkbox' name='remember_me' value='y' />Remember Me<br /><br />
#<input type='submit' name='login' value='Login' class='button' /><br />

login_data = urllib.urlencode({'XangaHeader$txtSigninUsername' : 'ZZZZZZZZZ',
                               'XangaHeader$txtSigninPassword' : 'ZZZZZZZZZ',
                               'signin' : 'Sign In'
                               })

resp = opener.open('http://holly-hahaha.xanga.com/XangaLock.aspx?user=holly_hahaha&ReturnUrl=http://holly-hahaha.xanga.com/home.aspx%3fuser%3dholly_hahaha%26nextdate%3d10/21/2005', login_data)
#you are now logged in and can access "members only" content.
#when your all done be sure to close it

page = urllib2.urlopen("http://holly-hahaha.xanga.com/?nextdate=10/21/2005")
print page.read()
resp.close()

But the webpage source code I get is not what I expect,
I expect there would be some Chinese text,
but there are no a single Chinese character.
And it obviously is the login page, because I see:

<head>
<title>Xanga - Signin Lock</title>

Problem solved.
I found out that, no matter what technology/method the website used to submit the username and password, what they sends are fixed.
We can use some tools in browser to check the package/header sent, and thus we can discover the name of the variable. For example, I used a add-on in firefox named Live http headers. Then for each communication between browser and server, we know what they send. For example, in xanga, when we press login, what they send to the server is :

http://hk.xanga.com/front.aspx

POST /front.aspx HTTP/1.1
Host: hk.xanga.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 GTB7.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Proxy-Connection: keep-alive
Referer: http://hk.xanga.com/
Cookie: __gads=ID=3b1be96f92f256c2:T=1279503979:S=ALNI_MZ-Hq7T3QNrg91hsHqHkJcnuhZ21w; FFSkp=1044,646,16; __qca=P0-1691570154-1279504135090; __utma=259717779.1902645459.1279504131.1279504131.1279504131.1; __utmb=259717779.2.10.1279504131; __utmc=259717779; __utmz=259717779.1279504135.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Content-Type: application/x-www-form-urlencoded
Content-Length: 101
IsPostBack=true&XangaHeader%24txtSigninUsername=AAAAAAAA&XangaHeader%24txtSigninPassword=AAAAAAAA


Thus, we can send 3 information to the server thru python, namely "IsPostBack"=true, "XangaHeader%24txtSigninUsername"/"XangaHeader$txtSigninUsername"=username, "XangaHeader%24txtSigninUsername"/"XangaHeader$txtSigninUsername"=password

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.