Blocking php curl from scraping website content

Question

knkk 0 Newbie Poster

13 Years Ago

There is this function:

function disguise_curl($url) 
{ 
	$curl = curl_init(); 

	// setup headers - used the same headers from Firefox version 2.0.0.6
	// below was split up because php.net said the line was too long. :/
	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
	$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
	$header[] = "Cache-Control: max-age=0"; 
	$header[] = "Connection: keep-alive"; 
	$header[] = "Keep-Alive: 300"; 
	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
	$header[] = "Accept-Language: en-us,en;q=0.5"; 
	$header[] = "Pragma: "; //browsers keep this blank. 

	curl_setopt($curl, CURLOPT_URL, $url); 
	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'); 
	curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
	curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); 
	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
	curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
	curl_setopt($curl, CURLOPT_TIMEOUT, 10); 

	$html = curl_exec($curl); //execute the curl command 
	if (!$html) 
	{
		echo "cURL error number:" .curl_errno($ch);
		echo "cURL error:" . curl_error($ch);
		exit;
	}
  
	curl_close($curl); //close the connection 

	return $html; //and finally, return $html 
}

...that several people seem to use to scrape content off a website (to state the obvious, you would do "echo disguise_curl($url)").

Is there any way to detect if someone is doing that to my site, and block access to them or show a page with a specific message?

I've experimented with some sites to see if they manage to block access this way, and found http://london.vivastreet.co.uk manages to do that. I haven't been able to figure out how, but maybe someone can.

A second query: Why would someone write a complicated function like that when get_file_contents($url) does the same? Is it to avoid suspicion?

Thank you very much for your time.

php

3 Contributors
7 Replies
2K Views
3 Days Discussion Span
Latest Post 13 Years Ago Latest Post by knkk

All 7 Replies

storm123 0 Newbie Poster

13 Years Ago

I access this link via curl without any problem ?

About blocking curl to prevent from scraping - i think Curl it is just like using a browser to get to your site. If you put something up that can be browsed to, someone else can get it with curl.
How is this different from fetching with any other web browser and saving offline?
It could be a normal valid user, you never know.

storm123 0 Newbie Poster

13 Years Ago

Yes i tryed both links and curl_exec(); is working and pages loading.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

knkk 0 Newbie Poster · Answer 1 · 2010-06-09T16:52:42+00:00

I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?

knkk 0 Newbie Poster · Answer 2 · 2010-06-09T18:54:24+00:00

Thanks, storm123. Did you try the link http://london.vivastreet.co.uk/cars+london or the link http://london.vivastreet.co.uk? The former seems to be working with cURL, the latter is throwing up a message saying the site is temporarily down.

knkk 0 Newbie Poster · Answer 3 · 2010-06-10T11:59:32+00:00

I got the problem. I was sending the url to the disguise_curl() function after doing a url_decode first, and so the "+" in the "cars+london" part was becoming " " (space), resulting in the error page I was seeing. So you are right, cURL works for this page, too.

amac44 0 Newbie Poster · Answer 4 · 2010-06-10T22:33:30+00:00

Re: one question, I think you have more control over headers with curl than with file_get_contents (or fopen), so the server can't easily tell if the request is really coming from Firefox 3.6.3 like it's pretending to be.

You essentially have to use a captcha-related technique, but then I don't know how you would allow search engines in, but keep curlers out?

knkk 0 Newbie Poster · Answer 5 · 2010-06-12T16:41:46+00:00

thank you, amac44. yes, using captchas looks impractical...

Blocking php curl from scraping website content

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers