Dear All,

Can aynyone help me with a script which could scan the regions of one file to the regions with second file,
for example file 1 looks like this,

chr5    88013976    88018740

file 2 :

chr5    88013975    88018742    ENST00000340208.5
chr5    88024310    88024445    ENST00000340208.5
chr5    88025035    88025164    ENST00000340208.5

I want the output as

chr5    88013975    88018742    ENST00000340208.5

,
as the cordinates from file1 is within the cordinates from first that particular location in the file 2.

I have this what I tried, for which i have results ..But my file is pretty big so I need much faster solution .
Any help is really appreciated.

#!usr/bin/perl
$/=undef;
open(INA,$ARGV[0]);
$file1=<INA>;
open(INB,$ARGV[1]);
$file2=<INB>;
@file1=split(/\n/,$file1);
@file2=split(/\n/,$file2);
#while ($sample1=<FILE1>) 
foreach $file1(@file1)
{   
    chomp $sample1;
    #$genename1=$file1;
    #print "$genename1\n";
    @temp1 =split('\n', $file1);
        @temp1 =split('\t', $file1);
    #$genename=$temp1[0];
    $chr1=$temp1[0];
    $start1=$temp1[1];
        $end1=$temp1[2];
    @temp1=join("\t",@temp1);
    #print "$chr1\t$start1\t$end1\n";
    #while($sample2=<FILE2>)
foreach $file2(@file2)
{   
    chomp $sample2;
    #$genename1=$file1;
    #print "$genename1\n";
    @temp2 =split('\n', $file2);
        @temp2 =split('\t', $file2);
    #$genename=$temp1[0];
    $chr2=$temp2[0];
    $start2=$temp2[1];
        $end2=$temp2[2];
    #@temp2=join("\t",@temp2);
    #print "$start2\n";
    #while($sample2=<FILE2>)
if($chr1==$chr2 && $start1>=$start2 && $end1<=$end2 )
        {
                        print"@temp2\n";

        }
}
}

Hi Anna123,

First off, you need know that Perl as a language could be written better than what you have in your post.
Moreover, one should ALWAYS write a Perl program using the two pragmas:

use warnings; and use strict;

Secondly, though one could hack perl sytanx togehter yet it is a lot better to use it a modern and a better way.

That said, what you need do is to open your files, read the first one and use a data structure to have HASH OF ARRAYS (HOA), then open and read the second file, match the key of the your HOA with that from the second file and test if the start from the first file.

Something like this can help:

use warnings;
use strict;

my $doc = {};

open my $fh, '<', $ARGV[0] or die "can't open file: $!";
while (<$fh>) {
    my @data = split /\s+/, $_;
    push @{ $doc->{ $data[0] } } => $data[1], $data[2];
}

close $fh or die $!;

open $fh, '<', $ARGV[1] or die "can't open file: $!";
while (<$fh>) {
    my @data = split /\s+/, $_;
    if ( exists $doc->{ $data[0] } ) {
        my $begin = $doc->{ $data[0] }[0];
        print $_ if ( $begin > $data[1] and $begin < $data[2] );
    }
}

close $fh or die $!;

The above script should do using the data sample you gave in your OP.

Note however, that I would rather use a subroutine, called function in some other programming language, to implement DRY - Don't Repeat Yourself!

This article has been dead for over six months. Start a new discussion instead.