0

http://www.iso-ne.com/markets/hrly_data/selectHourlyLMP.do

I'm trying to parse a CSV generated from the above link.

From what i gather the link is :
http://www.iso-ne.com/markets/hrly_data/selectHourlyLMP.do?locationType=HUB&node=4000&startDate=12%2F20%2F2011&endDate=12%2F20%2F2011

My problem is it seems half the tiem it downloads, half the time it doesn't, almost as if it gets cached in my system.

Is there a way to download and parse a CSV that is generated from a form? The one or two times that it has pulled, it seems that it is almost written in a hidden form outside the HTML.

Any info on maybe what i should look into to accomplish that would be appreciated.

TY!

3
Contributors
9
Replies
10
Views
5 Years
Discussion Span
Last Post by tgr0ss83
0

Is this your site?
Are you asking how this site does it?
.do files suggest jsp not php

Are you asking why this site fails sometimes? Or have you got a working example that fails sometimes?

0

I have an Example that worsk 10% of the time? Every time i get the link it seems that the file is cache'd

The site in question is the NEPOOL .. which is New Englands Utility Grid. The data i'm trying to obtain is publicly avilable data, but i'm building it through my dynamic table.

The CSV seems to generate in a hidden from, from what i've seen, becuase the fgetcsv when it worked would get the CSV contents + the page html.

My question is Is there a way with PHP to Download the file from the dynamic link? i've tried to make a 3-5 second sleep among other things. My feeling is that I infact DO NOT have the link straight, or since the file is dynamically generated that it does not have enough time.

I guess my question is, Should i be looking somewhere else than your standard fgetcsv commands to achieve getting this working properly.

I've spent about the last 7 hours trying variations of links and multiple passings fo variables through the URL. in the end of the day, it is my assumption that it is just being cached in my system and therefore downloading.

Plainly put, I'm stumped on the way i should approaching this, as it seems that the variables that are passed through the URL are VERY straight forward.

0

Do they have an api or have you contacted them for a link to their data? Or are you being sneaky and using their data without them knowing? Are they leaking bandwidth to you?

0

I guess you could call it being sneaky, but from talking to one of our suppliers, as long as we're not hitting them every 10 seconds they don't have a problem.

All the data i'm pulling is publicly available. In most cases these API's and what not are only available to people who are Generators or REPs (retail electricity providers).

Pull it when the intervals fire.

0

I found a PDF on their site that shows me how to get the full files, but the ones i was targeting are more specialized and are not 1000's of lines but 10-50 rather.

Anwyay, i'll read more through this and see if it offers me the solution.

0

If you check the headers of that link you will see that first time you download the file you get:

HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Date: Tue, 20 Dec 2011 23:15:55 GMT
Expires: Tue, 20 Dec 2011 23:16:55 GMT
Transfer-Encoding: chunked
Connection: Keep-Alive
Set-Cookie: JSESSION_ID_hrly_data=G1zpTxXLBnKKK3jqXh0vpkJzh8VtkVcq1jPrjDNKtrcJ5kwv8nSm!1260773349; domain=.iso-ne.com; path=/; HttpOnly
Set-Cookie: markets_hrly_data_lb=ROUTE.markets_hrly_data0; path=/markets/hrly_data;
Last-Modified: Tue, 20 Dec 2011 23:15:54 GMT
X-Powered-By: Servlet/2.5 JSP/2.1

Right after:

HTTP/1.1 200 OK # this is probably 304, my DNS returns 200
Content-Type: text/html; charset=ISO-8859-1
Date: Tue, 20 Dec 2011 23:16:03 GMT
Expires: Tue, 20 Dec 2011 23:16:55 GMT
Transfer-Encoding: chunked
ETag: W/"-1-1324422955000"
Connection: Keep-Alive
Set-Cookie: markets_hrly_data_lb=ROUTE.markets_hrly_data0; path=/markets/hrly_data;
Last-Modified: Tue, 20 Dec 2011 23:15:55 GMT
X-Powered-By: Servlet/2.5 JSP/2.1

What we see in both response headers is the expiration time, set to 1 minute after first download, for each subsequent request in the minute right after, you will get always the same data, most of the times. Is not only the Date to state this, but also ETag used to compare cached version with server response.

You can try to alter ETag in your request and see what you get, I tried and I get HTTP/1.0 303 See Other from OpenDNS which is caching, so try it by your self:

curl -I -H "ETag: W/-1-1324425159000" http://www.iso-ne.com/markets/hrly_data/selectHourlyLMP.do?locationType=HUB&node=4000&startDate=12%2F20%2F2011&endDate=12%2F20%2F2011

Anyway, if you don't get any fresh result you can set your script to check ETag, in this case his value between -1- and 000 is the unixstamp of when the cached resource is created, which is always 1 minute before expire date. Hope is useful, bye.

Edited by cereal: n/a

0

i've read something on this. just got home from the office, will look into both this and the pdf more thoroughly in the AM,

Thank you

0

I had a look at spoofing the form. A little more complicated as there are a few intermediate submits in order to set up subsequent dropdowns.
Even with a full url, it seems you need to press the Download CSV button to get there.

I'm sure with a bit of ingenuity and painstaking scraping for each combination, you could get a spoofed form.

0

"I'm sure with a bit of ingenuity and painstaking scraping for each combination, you could get a spoofed form."

I'm sure w/ a little java love it can be done pretty easily. I'll find a way, it's pretty much my goal for the day :)

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.