Hi everyone, working on a class project where we spider some patents from the USPTO (Patent office) and eventually map out our findings. Our teacher provided a script and expected us to hardwire directly to a modem, not a router. That is asking a lot since I need to provide wifi to many devices.

The first thing I did to circumvent the modem requirement...

In the code, we have to use port 7890.
I assigned my computer a manual IP address, and forwarded port 7890 to that IP address. 7890-->192.168.0.165

The first part of the code is to create a socks server
I summon it in the (OS X) Terminal using

perl TaskDist.pl filename.txt

TaskDist.pl code is as follows:

use strict;
use IO::Socket;

my $sock = new IO::Socket::INET(
                   LocalHost => '192.168.0.165', #change to your pc ip as server ip
                   LocalPort => 7890,
                   Proto     => 'tcp',
                   Listen    => SOMAXCONN,
                   Reuse     => 1);
$sock or die "no socket :$!";
STDOUT->autoflush(1);
my($new_sock, $buf);

open(f, $ARGV[0]);
my @theids = <f>;
close(f);

my $theid;
foreach $theid (@theids){
	$new_sock = $sock->accept();
    my $buf = <$new_sock>;
	print ($new_sock $theid."\n");
	print $buf . "   " . $theid."\n";
    close $new_sock;
}

This part seems to work fine using my forwarded port. My SOCKS server is setup at this point I believe, I don't know if the next part of this series of scripts should be using my internal or external IP.

The next part of the code, I am having a hard time with the input in the Terminal, and possibly the code. This is while I have my SOCKS server open in another Terminal.

use IO::Socket;
use HTML::TokeParser;
use LWP;
use URI::Escape;
use Sys::Hostname;
use strict;

my $host = $ARGV[0];
STDOUT->autoflush(1);
my $position=$ARGV[1];

my $browser = LWP::UserAgent->new();
$browser->agent("Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)");
$browser->proxy('http', $ARGV[1]);
my $response;


while (1){
	my $sock = new IO::Socket::INET(
	                  PeerAddr => $host,
	                  PeerPort => 7890,
	                  Proto    => 'tcp');
	$sock or die "no socket :$!";
	if (length($position)==0){
		$position=hostname();
	}
	print ($sock $position."\n");
	my $filename= <$sock>;
	close $sock;
	
	$filename =~ s/\n//;

	open(f, $filename);
	my @theids = <f>;
	close(f);

	my $theid;

	foreach $theid (@theids){
	#http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=4484233.PN.&OS=PN/4484233&RS=PN/4484233
	#http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220010020884%22.PGNR.&OS=DN/20010020884&RS=DN/20010020884
		$theid =~ s/\n//;
		my $pat_url;
		if (length($theid)<=8){
		$pat_url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=" . $theid . ".PN.&OS=PN/" . $theid . "&RS=PN/" . $theid;		
		}#151.207.240.23
		else{
		$pat_url = "http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%22" . $theid . "%22.PGNR.&OS=DN/" . $theid . "&RS=DN/" . $theid;			
		}#151.207.241.118
	#	print $pat_url;
		my $patno = $theid;

		if (-e "..\\dw_pat\\$patno.html"){
			select(stdout);
			print "skip $patno\n";
			
			next;
		}
		
		select(stdout);
		#print "getting pat $patno: $pat_url\n\n";
		print "getting pat $filename:$patno:\n\n";
		
		do {
			$response = $browser->get($pat_url);
			if (!$response->is_success()){
				select(stdout);
				print $response->status_line, "\n\n";
			}
			sleep(rand(7)+1);
		}
		while (!$response->is_success());			
		
		my $pat_desc = $response->content();	
		
		open(fpat, "> ..\\dw_pat\\$patno.html");
		select(fpat);
		print $pat_desc;
		close(fpat);
	}
}
exit;

We are supposed to use a proxy server to run this task. So I found a few, I don't know if I am supposed to use a SOCKS proxy or otherwise, and how I should enter the code in the terminal. I also do not know if I should be using my local IP address again as I had done to successfully get the SOCKS server working on my computer or if I should be using my external IP.

The instructions say to enter:

per spider_idThread.pl SERVERIP PROXYIP

This is one of the commands I have tried and the resulting error. I have tried internal/external IPs, I have tried SOCKS/regular proxys, I have tried adding ":port#" to the end of each, and even that 7890 from above.

$ perl spider_idThread.pl 192.168.0.165 58.86.219.231
no socket :Connection refused at spider_idThread.pl line 23.

Line 23 is this from above:

$sock or die "no socket :$!";

So this to me, seems like a problem figuring out which IP addresses to use. Any ideas? I appreciate your help.

I am attaching the relevant parts of the code and instructions, maybe someone can help with them. At this point, I have downloaded patent #s related to the subjects I am researching and placed 300 numbers on separate rows per document using a specific naming scheme used by the code. I need to now download the patents using this part of the perlscript. It will not run on my computer, and I am skeptical that I do not have all the perl script libs that are needed, even though I have added LWP and CPAN after installing XCode.

I have gotten past the step about setting up a SOCKS server which is 2.2 in this instruction sheet. Have tried on Leopard and XP, on WiFi and wired switches, have tried with port forwarding and without. I have tried both my local and external IP addresses in the script (TaskDist.pl for the SOCKS server) and internal/external in the input for spider_idThread.pl.

I have also tried several proxys, different types, using several different ports, in different countries.

I received an error 501 as the output, I searched around and as a result installed XCode, LWP and CPAN to my Mac OS X 10.5.6 install.

Is anyone else able to run this successfully, I have attached the instructions and code, I have been putting a lot of time and effort in to this and it is metaphorically eating away at me, if anyone can help, I would be most grateful. This script needs to download about 46K patents from the USPTO, but if you are able to get a small handful, enough to help me execute the code on my machine, that would be amazing.

Slightly outside my normal realm; I rarely (almost never) use perl. However....

"perl -w ..." is your friend. It'll show you a lot of stuff that's normally 'hidden'. Such as it appears that filehandles need to be UPPERCASE.

The second script needs to use the same IP address the listener used to open its socket. By way of analog, suppose you are talking to your instructor on your cell phone and to your S.O. via skype. To whisper a sweet nothing into your S.O.'s ear, you would speak into your computer's microphone; it just wouldn't do to coo thusly into your cell phone's mic. :) Networking and sockets behave essentially the same way, except the connections and paths have been abstracted to a virtual level, away from the physical.

Use 'netstat -l | grep 7890' to verify that the socket is really opened and awaiting connections.

Once I capitalized the FILEHANDLES, the first script ran and awaited connections, and the second script talked to the first.

So capitalize your f, stdout, fpat, etc., filehandles. Use the same IP address the 'server' uses for its listener socket. And use 'perl -w' to run each program until you've worked out the bugs. And there's no reason you can't put more debugging prints into each script so you can see what they're doing.

This article has been dead for over six months. Start a new discussion instead.