Hello again,

I've been experimenting with php and using it to create screen scraper but I have encountered a problem being a noob I am, when I came upon a dynamic page that sends out XMLHttpRequest to server to obtain new results.
The website that is called realtor.com and when I search for real estate in say chicago, I am using url http://www.realtor.com/realestateandhomes-search/60601 to get results
However the page displays only first 10 results and if I choose to display 50 results, it sends out XMLHttpRequest to http://www.realtor.com/search/resources.aspx(I found it using FireBug)
What I couldn't figure out since I don't know much about xmlhttprequests, is how it forms request to post in order to get the necessary data. And how to extract that data that it gets.
I've searched the web to see answer to my question but couldn't find something that would answer it.
Maybe someone has an answer for me here.
P.s I know it's prob against realtor's terms but I am using this site as an example to get a hold of concept.

Here are the request and response headers

Host: www.realtor.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
Accept: text/javascript, application/javascript, */*
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://www.realtor.com/realestateandhomes-search/60601
Content-Length: 2134
Cookie: Move_UUID=ea981d5b5f024fe9bafc0aff8a3648bf; HSID=527c5d4b88_R_dc:10.160.4.250:355483606868:R; rsi_segs=C05504_10005|D08734_70056|D08734_70079|D08734_70102|C05504_10039|C05504_10040|C05504_10041; ASP.NET_SessionId=su1mauilqtmk1pvkagsgye55; MetaKey=_server_error%7Csrp; ParamCookie=[]; s_cc=true; s_sq=%5B%5BB%5D%5D; widgetClicked=oldSRP; views=srp=list; previousState=MD; SRP_ShownWinks=1; listingdetailmpr=http%3A%2F%2Fwww.realtor.com%2Frealestateandhomes-search%2F60601%23%2Fpagesize-50%2Fpg-1; rowselected=3; currentRowIndex=3; agentId=30248; sid=745ad950dac666b18395744db424829febf4a966; recAlertSearch=recAlertShown=false&sameSrch=false&saveLstCnt=0&sid=; RecentSearch=loc%3d46842%26typ%3d3%26mnp%3d%26mxp%3d%26bd%3d0%26bth%3d0%26status%3d1|loc%3d23641%26typ%3d3%26mnp%3d%26mxp%3d%26bd%3d0%26bth%3d0%26status%3d1|loc%3dSPRAGUE%2cNE%2c68438%26typ%3d3%26mnp%3d%26mxp%3d%26bd%3d0%26bth%3d0%26status%3d1|loc%3dMIDLAND%2cMI%2c48667%26typ%3d3%26mnp%3d%26mxp%3d%26bd%3d0%26bth%3d0%26status%3d1|loc%3dChicago%2cIL%2c60601%26typ%3d3%26mnp%3d%26mxp%3d%26bd%3d0%26bth%3d0%26status%3d1; criteria=fhcnt=3&loc=Chicago%2cIL%2c60601&usrloc=Chicago%2cIL%2c60601&typ=3&status=1
Pragma: no-cache
Cache-Control: no-cache

Date: Thu, 11 Nov 2010 14:40:27 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie: SAVEDITEMS=; domain=realtor.com; expires=Wed, 10-Nov-2010 14:40:27 GMT; path=/
ParamCookie=[]; path=/
criteria=pg=1&fhcnt=3&loc=Chicago%2cIL%2c60601&usrloc=Chicago%2cIL%2c60601&typ=3&status=1; domain=realtor.com; path=/
Cache-Control: no-cache
Pragma: no-cache
Expires: -1
Content-Type: text/javascript; charset=utf-8
ntCoent-Length: 246355
Content-Encoding: gzip
Transfer-Encoding: chunked

I am working on getting info from an ASPX https site that requires a login at the front-end. I did some screen-scraping in the past using the class_http library by Troy Wolf and that worked very well. Unfortunately, ASPX uses javascript and postbacks and it becomes very complicated. I spent two or three days researching different tools and trying them but I couldn't get anything to work.

In the end, I gave up on trying to do it with PHP and I used the AutoIt tool instead. Autoit is a (free) basic-like language that has a wide range of library routines for automation within the Windows (desktop) environment. The results have been great. Autoit has IE functions (and some for FireFox and Opera that are less capable) that work from outside the Browser so they still let the browser do its normal thing dealing with the Postbacks and so forth. In less time than I spent researching PHP tools I had something up and running. My first project was to extract data from a site and update a database on my server. For the receiving side, I built a custom PHP module so my Autoit program was talking to the ASPX program server and then to mine for each record transferred. To further automate it, I set it up on the Win7 task scheduler so it starts this job automatically every day.

Autoit can produce an executable version (exe) so you can produce a module that you can provide to a client who doesn't need to have Autoit installed. I had an old version of Armadillo (now Software Passport) lying around so I was also able to use that to protect the exe, license it and provide time or usage delimited trials. All in all, this turned out to be a pretty good solution. If I had found a workable PHP solution it would have been nice to put it on the server and use Cron to start it automatically but this was a pretty good alternative. If you have programming experience (maybe a bit of Basic), the Autoit language is pretty easy. They have a support forum that is active and lots of examples. If you decide to go this route and run into problems, you can PM me (after giving it your best effort) and I may be able to give you a pointer or two.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.