Folks,

Why do I keep on getting this error when everytime I input a url in the url field in the following web proxy script:

"The specified URL could not be returned due to a status code of 400."

It says unless the Status Code is 200 then to show errors. I guess Status 200 means the web server managed to serve the page without any problems.

<?php
    error_reporting(0);

    session_start();

        //Settings Instructions: https://darkpolitics.wordpress.com/2009/12/29/create-your-own-web-proxy-server/

    // turn debug messages on when debugging your proxy
    //$DEBUG = true;
    $DEBUG = false;

    // set this to the location of the webproxy page if you know where its going to be otherwise this function will work it out.
    // for performance you should hardcode this to your webproxy location
    //$PROXYURL = "http://www.mysite.com/myproxy.php";
    $PROXYURL = get_current_location(); // works out current scripts location

    // urls from orig search will be $_POST but then future links we proxify will be $_GET
    $url = $_REQUEST["url"];
    $useragent = $_POST["useragent"]; // will only be a POST from search form

    ShowDebug("useragent posted from search form = $useragent");

    // set the user-agent we will surf with. We only set on initial search and then use a session to pass this var to any
    // other content passed through the proxy. Make sure you have session cookies enabled for your proxy page!
    if(!empty($useragent)){
        if($useragent=="us"){
            $surf_useragent  = $_SERVER["HTTP_USER_AGENT"]; // use current agent
        }else if($useragent=="ie"){
            $surf_useragent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)";     // use IE 7
        }else{ // must be ff as we only have 2 choices!! Add as required
            $surf_useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)"; // use FF3
        }
        // set a session for future calls through the proxy
        $_SESSION["surf_useragent"] = $surf_useragent;
    }else{
        $surf_useragent = $_SESSION["surf_useragent"];
    }

    ShowDebug("surf with agent = $surf_useragent");

    $err = false;
    $msg = "";
    $content = "";
    $subpathurl ="";
    $pathurl = "";
    $siteurl = "";

    // this list contains domains that this proxy will allow obviously in your own proxy you can remove this!!
    $whitelist = "technicallypolitical.com,strictly-software.com,infowars.com,prisonplanet.com,hashemian.com";
    $cansearch = false;

    ShowDebug("url = $url");
    ShowDebug("useragent = $useragent");
    ShowDebug("PROXYURL = $PROXYURL");

    if(!empty($url)){

        ShowDebug("url = $url");

        // make sure its valid with a protocol at the start
        if($url == "http://"){
            $err = true;
            $msg = "Please specify a full URL to access e.g http://www.darkpolitricks.com";
        }else if(!preg_match("/https?:\/\//",$url)){
            $err = true;
            $msg = "Please specify the protocol within the URL e.g http://";

            ShowDebug("error = $msg");
        }else{

            ShowDebug("get content from remote url $url");

            if(!empty($whitelist)){
                // check whether url is allowed
                $allowed = explode(",",$whitelist);
                $count = count($allowed); 
                $lowurl = strtolower($url);

                ShowDebug("check whether $lowurl is in whitelist of $whitelist");

                foreach($allowed as $val){
                    ShowDebug("check whether ".$val." is in $url");

                    if( strripos($lowurl, $val) !== false){
                        ShowDebug("This url $url is on whitelist matching $val");
                        $cansearch = true;
                        break;
                    }

                }
            }else{
                $cansearch = true;
            }

            if(!$cansearch){
                $err = true;
                $msg = "The url is not allowed to be accessed from this web proxy server.";
            }else{

                // crawl item e.g URL, script, CSS, image
                $html = mycrawler_single($url,$surf_useragent);

                $content = $html["html"];
                $status = $html["status"];
                $headers = $html["header"];
                $content_type = $html["content_type"];
                $connect_error = $html["message"];

                ShowDebug("connect error = $connect_error");
                ShowDebug("status = $status");

                // a status code 200 means we got a successful request back if we didn't then we have an issue
                if($status!="200"){

                    // 404 = Page not found
                    if($status=="404"){
                        $err = true;
                        $msg = "The specified URL could not be located.";
                    }else if(!empty($connect_error)){
                        $err = true;
                        $msg = $connect_error;

                        ShowDebug("CONNECT ERROR = $connect_error; msg = $msg");
                    }else{
                        $err = true;
                        $msg = "The specified URL could not be returned due to a status code of $status.";
                    }

                }else{

                    // need to replace all links in our returned content with links to the proxy so that future clicks are proxified
                    $urlinfo = parse_url($url);

                    // get root url to exend any relative links e.g http://www.mysite.com
                    $siteurl = $urlinfo["scheme"]."://".$urlinfo["host"];
                    if(!empty($urlinfo["path"])){
                        $pathurl = $siteurl.$urlinfo["path"]; 

                        // make sure file is removed in case we need current sub directory
                        $pospath = strripos($pathurl, "/");

                        if($pospath!==false){

                            ShowDebug( "take up to / as pos $pospath in $pathurl<br />");

                            $subpathurl = substr($pathurl,0,$pospath);
                        }else{
                            $subpathurl = $pathurl."/";
                        }
                    }else{
                        $pathurl = $siteurl;
                        $subpathurl = $pathurl."/";
                    }

                    ShowDebug("SiteURL = $siteurl path = $pathurl");

                    // for text related content we scan for links so that we can change them all to go through our proxy
                    // for images and other non textual content we have no need to change the links
                    if(preg_match("/(text|html|xml|xhtml|css|javascript)/i", $content_type )){
                    //if(preg_match("/(text|html|xml|xhtml)/", $content_type )){

                        ShowDebug("parse links");

                        // make sure all links are rerouted through proxy
                        $content = reformat_links($content,$siteurl,$subpathurl);

                    }

                    // As all links/src values from the page we visit need to pass through the proxy as well we need to ensure
                    // to output the correct header for file. For example a PNG image needs to have the correct header e.g image/png

                    ShowDebug("output content-type: $content_type");

                    header( $content_type );

                    ShowDebug("output content = $content");

                    // output content to screen
                    echo $content;
                }
            }
        }
    }else{
        // default url to http://
        $url = "http://";
    }

    // Will return the current location of the script running. If the proxy page is moved around a lot then this
    // will work out where it is but for performance set the value at the top in $PROXYURL
    function get_current_location(){

        $url = "";

        if( $_SERVER["SERVER_PORT"]== 443){
            $protocol = "https://";
        }else{
            $protocol = "http://";
        }

        $url = $protocol . $_SERVER["SERVER_NAME"] . $_SERVER["SCRIPT_NAME"];

        return $url;
    }

    // retrieve link destinations and modify them so that when they are clicked the content is passed through the proxy
    // as well. I look for src/href tags. Currently this does not handle URLs defined like so href="../"
    function reformat_links($content,$siteurl,$subpathurl){ 
        // need to make all URLs go through our proxy! use ISAPI rewriting to make it nicer this is just a guide
        global $PROXYURL;

        $relurl = $PROXYURL . "?url=" .$siteurl; // for urls like url="/sub/page.htm"
        $cururl = $PROXYURL . "?url=" .$subpathurl; // for urls like url="page.htm"
        $absurl = $PROXYURL . "?url=";  // for urls like url="http://www.mysite.com/page.htm"

        ShowDebug("reformat rel urls = $relurl");
        ShowDebug("reformat cur urls = $cururl");
        ShowDebug("reformat abs urls = $absurl");

        $newcontent = $content;

        // get all links and reformat
        // as we don't want to do the same links multiple times which happens I use placeholders first and then
        // once every possible location has been marked I insert the link to the proxy

        // look for absolute urls e.g url="http://www.mysite.com/blah.asp"
        $newcontent = preg_replace("/((?:href|src)=['\"])(http.*?)(['\"])/i","$1##ABSURL##$2$3",$newcontent);

        // get links starting with / e.g url="/sub/page.htm"
        $newcontent = preg_replace("/((?:href|src)=['\"])(\/.*?)(['\"])/i","$1##RELURL##$2$3",$newcontent);

        // get links starting like url="page.htm"
        $newcontent = preg_replace("/((?:href|src)=['\"])([^#h\/][^#t][^t][^p].*?)(['\"])/i","$1##CURURL##$2$3",$newcontent);

        // now replace placeholders 
        $newcontent = str_replace("##RELURL##",$relurl,$newcontent);    

        $newcontent = str_replace("##CURURL##",$cururl,$newcontent);    

        $newcontent = str_replace("##ABSURL##",$absurl,$newcontent);                

        ShowDebug("return content");

        return $newcontent; 
    } 

    // code to load remote content such as HTML files, CSS, Images etc
    // To follow more than 3 redirects (e.g ISAPI rewrites then change $maxredirs=XX)
    function mycrawler_single($url, $useragent="",$timeout=10, $maxredirs=3) 
    {
        ShowDebug( "IN mycrawler_single Get URL content from $url $useragent maxredirs = $maxredirs");

        $urlinfo = parse_url($url);

        if (empty($urlinfo["scheme"])) {$urlinfo = parse_url("http://".$url);}                                                                  
        if (empty($urlinfo["path"])) {$urlinfo["path"]="/";}

        if (empty($urlinfo["port"]))
        {
                switch($urlinfo["scheme"])
                {
                    case "http":
                        $urlinfo["port"] = 80;
                        break;  
                    case "https":
                        $urlinfo["port"] = 443;
                        break;                
                }
        }

        // if no agent is supplied use default agent
        if (empty($useragent)) $useragent = $_SERVER["HTTP_USER_AGENT"];

        ShowDebug("useragent to use = $useragent");

        if (isset($urlinfo["query"]))
        {
            $request = "GET ".$urlinfo["path"]."?".$urlinfo["query"]." ";
        } else {   
            $request = "GET ".$urlinfo["path"]." ";
        }

        // form request
        $request .= "HTTP/1.0\r\n";
        $request .= "Host: ".$urlinfo["host"]."\r\n";
        $request .= "User-Agent: ".$useragent."\r\n";
        $request .= "Connection: close\r\n\r\n";

        ShowDebug( "request = ".$request);

        ShowDebug( "open ".$urlinfo["host"].":".$urlinfo["port"]);

        $fp = @fsockopen($urlinfo["host"], $urlinfo["port"], $errno, $errstr, $timeout);

        if (!$fp)
        {
            ShowDebug( "ERROR! (".$errno.")".$errstr);

            $urlinfo["header"] = "";
            $urlinfo["html"] = "Error: $errno $errstr"; 
            $urlinfo["status"] = 400; // bad request
            $urlinfo["content_type"] = "";
            $urlinfo["message"] = "The request could not be made. $errno $errstr";

            return $urlinfo;  
        }
        else
        {   
            ShowDebug($request);

            fwrite($fp, $request);

            while (!feof($fp)) 
            {
                if(isset($data)){
                    $data .= fgets($fp, 4096);                      
                }else{
                    $data = fgets($fp, 4096);
                    ShowDebug( "take status code from 9,4 in data = ".$data);

                    // status code should be here! if not its a bad request
                    $code = trim(substr($data,9,4));                    
                    ShowDebug( "Status Code = ".$code);                 
                }
            }

            ShowDebug( "Status Code = ".$code); 

            // if no status code default to 400 = Bad Request
            if(empty($code) || !is_numeric($code)){

                $code = 400;

                ShowDebug("default to bad request 400");
            }

            ShowDebug("status code = $code - response = $data");

            fclose($fp);   

            $tmp = explode("\r\n\r\n", $data, 2);

            // We will return an array with these parts header, html, status code and content-type
            $urlinfo["header"] = $tmp[0];
            $urlinfo["html"] = $tmp[1]; 
            $urlinfo["status"] = $code;
            $urlinfo["content_type"] = get_content_type($tmp[0]);
            $urlinfo["message"] = "";

            ShowDebug( "The Status Code = ".$urlinfo["status"]." from header: ".$urlinfo["header"]);

            // handle redirects
            ShowDebug( "do we need to redirect? pos of location in header = ". stripos($urlinfo["header"], "location:"). " maxredirs = $maxredirs");

            if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
            {
                ShowDebug( "found location in header and we CAN REDIRECT");

                preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);

                if ($match)
                {    
                    $redirect = trim($match[1]);

                    ShowDebug( "Redirecting to ".$redirect);
                    ShowDebug( "$maxredirs is currently $maxredirs");

                    $maxredirs--;                         

                    ShowDebug( "$maxredirs after count down is now $maxredirs");

                    ShowDebug( "DO A REDIRECT TO $redirect");

                    return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
                }
            }       

            ShowDebug( "RETURN FROM mycrawler_single");

            // return array of header/html
            return $urlinfo;          
        }        
    }

    // will check headers for the content-type. We need this so that images are displayed correctly
    function get_content_type($headers){
        $content_type = "";

        if(!empty($headers)){
            $headerarray = explode("\r\n", $headers);
            foreach($headerarray as $head){

                ShowDebug( "header item = ".$head);

                if(preg_match("/Content-Type: .+$/i",$head)){
                    $content_type = $head;
                    break;
                }               
            }
        }

        ShowDebug("return $content_type");

        return $content_type;
    }

    // Debug function if you want to show debug e.g for testing your proxy then turn $DEBUG = True at top of page
    // for performance all ShowDebug statements should be removed on production to reduce unneccessary function calls
    function ShowDebug($msg){
        global $DEBUG;
        if(!$DEBUG) return;
        if(!empty($msg)){
            echo htmlentities($msg)."<br />";
        }
    }

if(empty($url) || $url=="http://" || $err){
?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"> <head> <title>Dark Politricks Web Proxy Example</title> <meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/> <meta name="keywords" content="DarkPolitricks, WebProxy, Proxy, Proxies, Proxi, Proxied, Forwarded-For" /> <meta name="description" content="An example of a web proxy, how you can make your own web proxy to bypass basic filtering" /> <!-- Put all these in an external stylesheet --> <style>
    body{background:lightblue;}
    p{font-weight:bold;}
    .error{color:red;}
    .msg{color:green;}
    #main{margin:auto;width:600px;}
    #search{margin:auto;width:600px;}
    label{font-weight:bold;font-face:Tahoma,Arial;}
    #url{width:300px;}
    #searchflds{border:1px solid black;}
    dt{float:left;}
    dd{float:left;}
    #domainlist{font-style:italic;color:navy;}
    #searchbutton{text-align:right;}
    #agent{clear:both;}
    .agent{margin-top:10px;}
    #ie{margin-left:-12px;}
</style> </head> <body> <div id="main"> <h1>Example of a WebProxy</h1> <?php
        if(!empty($msg)){
            if($err){
                echo "<p class='error'>$msg</p>";
            }else{
                echo "<p class='msg'>$msg</p>";
            }
        }
        ?> <p>This is an example page and can only be used to access the following domains:</p> <p id="domainlist">technicallypolitical.com, strictly-software.com, infowars.com, prisonplanet.com</p> <p>Please read the related article at <a href="http://www.darkpolitricks.com/2009/12/create-your-own-web-proxy-server" title="Create your own web proxy">www.darkpolitricks.com</a> to get more information as well as a link to download the code so that you can create your own web proxy.</p> <div id="search"> <form id="searchanon" name="searchanon" method="POST"> <fieldset id="searchflds"> <dl> <dt><label for="where">Where To</label></dt> <dd><input type="text" id="url" name="url" value="<?php echo $url ?>" maxlength="100" /> </dl> <dl id="agent"> <dt class="agent"><label for="useragent">User-Agent</label></dt> <dd class="agent"><input type="radio" name="useragent" id="ie" value="ie" <?php if($useragent=="ie"){ echo 'checked="true"'; } ?> /><label for="ie" title="Use IE 7 user-agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)">IE 7</label> <input type="radio" name="useragent" id="ff" value="ff" <?php if($useragent=="ff"){ echo 'checked="true"'; } ?> /><label for="ff" title="Use FireFox 3 user-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)">FireFox 3</label> <input type="radio" name="useragent" id="us" value="us" <?php if($useragent=="us"){ echo 'checked="true"'; } ?> /><label for="ff" title="Keep existing agent: <?php echo $_SERVER["HTTP_USER_AGENT"] ?>">Keep Existing User-Agent</label> </dd> </dl> </fieldset> <p id="searchbutton"><input type="submit" value="Go There" id="submitsearch" name="submitsearch" /> </form> </div> </div> </body> </html> <?php
}
?>

[/code]

And, how do I remove the restrictions so any website can be viewed apart from:

[code]

$whitelist = "technicallypolitical.com,strictly-software.com,infowars.com,prisonplanet.com,hashemian.com";

[/code]

I removed the above mentioned urls from the $whitelist and it worked as I was able to view google but then the 400 error started appearing.
How would you change the code and where ?

Thanks for the reply Droopsnoot!

As far as I understood, the websites in the $whitelist can only be viewed in that proxy and none other. No, I don't want to turn the websites listed in the $whitelist into $blacklist. I just want to remove the restrictions so any websites can be viewed.

As far as I understand, that 400 error is coming from the proxy server. I reckon it detects the webpage is not being presented by the web server and it checks what error the web server shows and then it shows us it's own custom error message. In this case:

The specified URL could not be returned due to a status code of 400.

The concerned lines are 112 to 127:

    // a status code 200 means we got a successful request back if we didn't then we have an issue
                if($status!="200"){

                    // 404 = Page not found
                    if($status=="404"){
                        $err = true;
                        $msg = "The specified URL could not be located.";
                    }else if(!empty($connect_error)){
                        $err = true;
                        $msg = $connect_error;

                        ShowDebug("CONNECT ERROR = $connect_error; msg = $msg");
                    }else{
                        $err = true;
                        $msg = "The specified URL could not be returned due to a status code of $status.";
                    }

Recommended Answers

All 3 Replies

Mod,

I need to edit my op as part of my post mssg got tangled with the code and it is seeming all messy. I want to edit my op now. How can I do it ?

Does anyone know of any good stable web proxy php scripts ?
I'm going to add my link tracker code so the proxified links get click-tracked. That's all. Tried MiniProxy and managed to add the tracker but it tracks not only the page the user is in but any links present on the page the user is at and tracking those extra links is not what I want and so I was looking into this proxy script instead which is now becoming more bothersome than the first.

You are welcome to sugges a simple php proxy script link.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.