Hello, I have about 500,000+ txt file for about 7+ gigs of data total.

I am using python to put them into a sqlite database. I am creating 2 tables, 1. is the pK and the hyperlink to the file.

For the other table I am using an entity extractor that was devloped in perl by a coworker.

To accomplish this I am using subprocess.Popen(). The problem I am running into is the load time...my perl opens, parses the file, and then closes. At the next file the process repeats. I would like for the perl to open, stay open, and allow me to just pass my reports to it ad hoc. Here is some snipit;

for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read()
   f.close()

   numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl" , reportString], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
   r = numberExtractor.communicate()
   #print r

Using just python or just perl isnt really an option for me. The only other work around I have is to run all the files through the perl scripts (which there are 3 of), the python script, and then write another python script which merges them all together. But that would be time consuming when it comes to data loading, and this needs to be a repeatable process for use by many.

So my question is there a better way to call this perl script, in my mind I would like the script to be called stay open and just accept input as I send it there, basically an interactive perl module. But I am open to anything.

Recommended Answers

All 3 Replies

It sounds like you would have to alter the perl script to accept input/inputs from some source other than the command line. You could use multiprocessing to run in parallel, or use one of the parallel python programs, especially if you have multiple cores
http://github.com/mattharrison/coreit
http://www.parallelpython.com/content/view/17/31/

So the perl script has been modified, it will only terminate if I explicitly tell it to. It will now accept input dynamically.

But now I am having trouble reading data... If I use communicate, at the next iteration in my loop my subprocess is terminated, I get an I/O error. If I try and use readline() or read(), it locks up.

This deadlocks my system and I need to force close python to continue.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)

for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read() 
   f.close()
   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'
   numberExtractor.stdin.write(reportString)
   x = numberExtractor.stdout.read()   
   print x

This lets me get my stdout one time, but on the second iteration I get ValueError: I/O operation on closed file.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)

for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read() 
   f.close()
   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'
   numberExtractor.stdin.write(reportString)
   x = numberExtractor.communicate()   
   print x

If I just run it like this, It runs through all the code fine. the print line is <open file '<fdopen>', mode 'rb' at 0x015dbf08> for each item in my folder.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)

for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read() 
   f.close()
   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'
   numberExtractor.stdin.write(reportString)
   x = numberExtractor.stdout
   print x

Hopefully I am making a simple mistake, but is there some way I can just send a file to my perll (stdin), get the stdout, and then repeat without having to reopen my subprocess for every file in my loop?

Connect stderr to a pipe, and you also have to flush the pipes, see, "Doug Hellmann's Interacting with Another Command" which is what I think you want to do. Also take a look at pexpect which should suffice from what you have explained, and is simple to use.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.