Hello there!
I wrote this code that performs removal of certain sequences, but its execution time exceeds one hour on a supercomputer. Could you help me to simplify it, so it will consume less time?
Big thanks!
/Robert

#!/usr/local/bin/perl -w

open (INPUT, "phastCons_200_chr6.txt") and print "phastCons is open\n";
open (OUT, ">phastCons_noex_chr6.txt") and print "phastCons_noex is open\n";

for $line_ph (<INPUT>) {
$overlap = 0;
@fields_ph = split (/\s+/, $line_ph);
open (INPUT2, "refFlat.txt") and print "refFlat.txt is open";
for $line (<INPUT2>) {
if ($overlap == 1) {
next;
}
@fields = split (/\s+/, $line);
if ($fields_ph[1] eq $fields[2] && $fields_ph[2] < $fields[5] && $fields_ph[3] > $fields[4]) {
@ex_start = split (/,/, $fields[9]);
@ex_end = split (/,/, $fields[10]);
for $i (0 .. $fields[8]-1) {
if ($fields_ph[2] < $ex_end[$i] && $fields_ph[3] > $ex_start[$i]) {
$overlap = 1;
}
}
}
}
close (INPUT2);
if ($overlap == 0) {
print OUT "$line_ph";
}
}
close (OUT);

Recommended Answers

All 7 Replies

Hi,

From the below code, i can see, you are comparing two files, based on some comparision you skip a line or write a line to file. I dont understand why are you opening second file, EVERY time during iteration? open (INPUT2, "refFlat.txt") Any way you are only reading from it.
Instead you can read contents of both file in two separate array and iterate over array and compare, and do whatever you want.

My guess, opening and closing the SAME file in a iteration, may be bit time consuming. Because, when you call open function, the underlying operating system will check its existence, permission and many more. Which is redundant in your case, because you are reading the same file in every iteration.
Note: This may not be the case with supercomputers.

...
open (INPUT2, "refFlat.txt") and print "refFlat.txt is open"; # WHY?
for $line (<INPUT2>) {
if ($overlap == 1) {
next;
}
@fields = split (/\s+/, $line);
if ($fields_ph[1] eq $fields[2] && $fields_ph[2] < $fields[5] && $fields_ph[3] > $fields[4]) {
@ex_start = split (/,/, $fields[9]);
@ex_end = split (/,/, $fields[10]);
for $i (0 .. $fields[8]-1) {
if ($fields_ph[2] < $ex_end[$i] && $fields_ph[3] > $ex_start[$i]) {
$overlap = 1;
}
}
}
}
close (INPUT2);
...

i have little tips for you,
- Use warnings and strict construct, so that you can debug.
- Are you trying to overwrite the file? because you have opened the file, WRITE mode and writing to it, in every iteration. Instead you can open file in append mode.
- make sure you close the opened files.

hope this helps.
kath.

The code can be written more efficiently. I do not have the time right now to look at it in detail but I will check back later today.

Dear Kath,
Thank you so much for your quick reply. I will do what you suggested. I am testing now append mode on supercomputer. Once again big thanks for your help and providing me with hints.
With kind regards,/
Robert

Hi,

From the below code, i can see, you are comparing two files, based on some comparision you skip a line or write a line to file. I dont understand why are you opening second file, EVERY time during iteration? open (INPUT2, "refFlat.txt") Any way you are only reading from it.
Instead you can read contents of both file in two separate array and iterate over array and compare, and do whatever you want.

My guess, opening and closing the SAME file in a iteration, may be bit time consuming. Because, when you call open function, the underlying operating system will check its existence, permission and many more. Which is redundant in your case, because you are reading the same file in every iteration.
Note: This may not be the case with supercomputers.


i have little tips for you,
- Use warnings and strict construct, so that you can debug.
- Are you trying to overwrite the file? because you have opened the file, WRITE mode and writing to it, in every iteration. Instead you can open file in append mode.
- make sure you close the opened files.

hope this helps.
kath.

Thank you Kevin. I will wait for your suggestion. Big thanks!/Robert

Can we see some of the data the script is porcessing? Does it just print just one line of data to the output file?

After looking at the code I need some explanations. You are using a binary flag, $overlap, to control some behavior of the script.

You open the first file and read in the first line and $overlap is false (0).

You open the second file and check if $overlap is true (1) and if it is you go to the next iteration (the next line) of the second file.

Now if $overlap is true (1) all the code does is loop through all the lines of the second file without ever doing anything else. If this is a big file that will waste lots of time.

So now the script reads all the lines of the second file and gets back to the second line of the first file. $overlap is once again set to false.

Then it does the same thing again like I described above. It loops through the lines of the second file until $overlap is set to true. Then it loops through all the rest of the lines in the second file again. Then it gets back to the third line of the first file.

That is most likely what is causing the scripts runtime to be so long.

Do you really need to check all the lines of the second file against all the lines of the first file?

How big are the files? Can I see some of the data? Are you reading this thread anymore?

Well, I had the code pasted into my perl IDE so making some assumptions I adjusted the code. It is of course untested and I am not sure it does what you want, but I guess you can try it:

#!/usr/local/bin/perl
use strict;
use warnings;

open (INPUT, "phastCons_200_chr6.txt") or die "Can't open phastCons: $!\n";
open (INPUT2, "refFlat.txt") or die "Can't open refFlat.txt: $!";
open (OUT, ">>phastCons_noex_chr6.txt") or die "Can't open phastCons_noex: $!\n";
print "phastCons and phastCons_noex and refFlat.txt are all open\n";

OUTTER: while (my $line_ph = <INPUT>) {
    my @fields_ph = split (/\s+/, $line_ph);
    INNER: while (my $line = <INPUT2>) {
        my @fields = split (/\s+/, $line);
        if ($fields_ph[1] eq $fields[2] && $fields_ph[2] < $fields[5] && $fields_ph[3] > $fields[4]) {
            my @ex_start = split (/,/, $fields[9]);
            my @ex_end = split (/,/, $fields[10]);
            for my $i (0 .. $fields[8]-1) {
                if ($fields_ph[2] < $ex_end[$i] && $fields_ph[3] > $ex_start[$i]) {
                    print OUT $line_ph;
                    seek  INPUT2, 0,0; # return to beginning of file INPUT2 - see note below
                    next OUTTER; # go to next line in INPUT
                }
            }
        }
    }
}
close INPUT;
close OUT;
close INPUT2;

Note: this may not work on some systems

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.