There is this function:

function disguise_curl($url) 
{ 
	$curl = curl_init(); 

	// setup headers - used the same headers from Firefox version 2.0.0.6
	// below was split up because php.net said the line was too long. :/
	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
	$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
	$header[] = "Cache-Control: max-age=0"; 
	$header[] = "Connection: keep-alive"; 
	$header[] = "Keep-Alive: 300"; 
	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
	$header[] = "Accept-Language: en-us,en;q=0.5"; 
	$header[] = "Pragma: "; //browsers keep this blank. 

	curl_setopt($curl, CURLOPT_URL, $url); 
	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'); 
	curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
	curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); 
	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
	curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
	curl_setopt($curl, CURLOPT_TIMEOUT, 10); 

	$html = curl_exec($curl); //execute the curl command 
	if (!$html) 
	{
		echo "cURL error number:" .curl_errno($ch);
		echo "cURL error:" . curl_error($ch);
		exit;
	}
  
	curl_close($curl); //close the connection 

	return $html; //and finally, return $html 
}

...that several people seem to use to scrape content off a website (to state the obvious, you would do "echo disguise_curl($url)").

Is there any way to detect if someone is doing that to my site, and block access to them or show a page with a specific message?

I've experimented with some sites to see if they manage to block access this way, and found http://london.vivastreet.co.uk manages to do that. I haven't been able to figure out how, but maybe someone can.

A second query: Why would someone write a complicated function like that when get_file_contents($url) does the same? Is it to avoid suspicion?

Thank you very much for your time.

Recommended Answers

All 7 Replies

I just realized that the specific URL I am not able to access with that function is http://london.vivastreet.co.uk/cars+london. Since that link wasn't accessible, I assumed the entire site wasn't accessible, and so posted the home page URL, which appears to be accessible through this function. Any idea why this is happening? Is the "+" in that URL doing something, or is there a way to block URLs from cURL?

I access this link via curl without any problem ?

About blocking curl to prevent from scraping - i think Curl it is just like using a browser to get to your site. If you put something up that can be browsed to, someone else can get it with curl.
How is this different from fetching with any other web browser and saving offline?
It could be a normal valid user, you never know.

Yes i tryed both links and curl_exec(); is working and pages loading.

I got the problem. I was sending the url to the disguise_curl() function after doing a url_decode first, and so the "+" in the "cars+london" part was becoming " " (space), resulting in the error page I was seeing. So you are right, cURL works for this page, too.

Re: one question, I think you have more control over headers with curl than with file_get_contents (or fopen), so the server can't easily tell if the request is really coming from Firefox 3.6.3 like it's pretending to be.

You essentially have to use a captcha-related technique, but then I don't know how you would allow search engines in, but keep curlers out?

thank you, amac44. yes, using captchas looks impractical...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.