SUMMARY
Hello all, I am currently working on a project that requires me to mine data from websites so i can work with it later. I have been using cURL to accomplish pretty much all the needs i have had thus far. I ran into a complication that cURL is not capable of hanlding and if someone could just point me in the right direction that would be awesome. I am compleltely combfortable if the solution needs to be in a different language if thats necessary.

WHAT I AM TRYING TO ACCOMPLISH
I am trying to mine data from a website that requires me to login and then make clicks to present the relevant data. The website I am trying to mine from is essentially run on Javascript so unless i can emulate these clicks I cannot get the information i need to be displayed. If someone could suggest a potential route for mining data from a Javascript based site that would be great

Thanks in advance

Recommended Answers

All 4 Replies

Regardless of how the client-side code is handled, whether by old-fashioned forms or a complex JavaScript application, the data must be fetched via a request. The trick would then not be to mimic the entire JavaScript application process, but only figure out what requests are being sent by the app and what data is needed to make them return the expected data. Then use cURL (or something like that) to mimic those requests.

A much simpler way would be to just have the site expose the data in a more mining-friendly way, in an XML or JSON formatted page. Surely if the owner of these sites does not mind you mining it, you can come to some sort of an arrangement to allow you to bypass these complications.

If the owner of the site is using complex and frequently updated JavaScript code specifically to defeat these kind of data-mining attempts, you honestly don't have much of a chance to maintain it. At least not without a LOT of continous work.

Atili, Thank you very much for your response. I attempted your first suggestion to find the executed JS but i was at a hault there because I am not a JS expert lol the only way i knew to pass that JS through was to the browser VIA the 'javascript:' command. Am I on the right path here but to ignorant to know it?

Unfortunaltely they do not have an XML feed or an API of sorts that would allow me to accomplish this easily. However I have seen other websites complete the EXACT same task with a good degree of accuracy (Not perfect, but good). If they are not doing this programatically it would have to be accomplished by hand and I seriously doubt that is the case.

Specifcally, I am trying mine data from an sportsbook website and get thier updated sports lines.

Again, I sincerely appreciate your time. Thank you

If you read my first post again, note that I was NOT suggesting trying to execute the JavaScript. In fact, that is highly unlikely to work. What I was suggesting was to find the requests made by the browser during the execution of the JavaScript apps and mimic those in your code.

You don't need to look through the JavaScript code of the target site to do this, or execute it in any way. All you need to do is monitor the requests made by the browser during normal use of the site, and then try to reproduce those requests during your mining operation. A browser developer tool like FireBug, or the built in developer tools for all the major browser (except Firefox, which uses the FireBug addon) can show you all the network activity going on, and details about each AJAX request made by the JavaScript code. That should be all the info you will need.

Granted, if the JavaScript code is doing something like encrypting the data before submitting it, you may have to reverse engineer a solution from the JavaScript code to use in your own code. Or just copy their JavaScript code and run it on Node.js or something on the server-side. That may be simpler.

I've tried quite a bit of this using different tools. In my experience, Autoit is the best tool to use. This is a PC-based Basic languauge so if you need a web-based solution then this won't work for you. It uses the COM interface to Internet Explorer and it's capable of doing almost anything in terms of logins, navigation, form-filling etc. Other solutions that I tried didn't work with sites based on Javascript. Autoit sits outside the Internet environment and acts like an automated user so it does work with Javascript-based sites. If you need to have the data that you extract on a server then the last step can be an automated upload (you may have to add some code to the server to support it).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.