Dear All,

I have two files first file looks like this,

ENSG00000000003.10  0

and the second one like this,

ENSG00000000460.12  24chr1  HAVANA  11869   14412   .   +   .   +   .   gene_id "ENSG00000223972.4";    transcript_id   "ENSG00000223972.4";    gene_type   "pseudogene";   gene_status "KNOWN";    gene_name   "DDX11L1";  transcript_type "pseudogene";   transcript_status   "KNOWN";

As you see I nee dto comapre the first column of the file 1 to the 11th column of file2, if they are similar then I need to get the out put like this,

    ENSG00000000003.10  0   gene_type   protein_coding


ENSG00000000005.5   0   gene_type   protein_coding

I have been trying in perl but My scripts taking long hours all these files are huge so it would be great if there is any scripts with PERL HASHES
tahnk you all in advance.!!

Recommended Answers

All 2 Replies

hello Anna123,
As much as I would like to help, I really don't see a correclation between the sample of the files you have posted. Or am I missing something?
Please, could you described or give more or better data sample and tell what you are trying to match.

That said, what I would have advised is that open the two files, get the first file into an hash data type, then go over the second file, a step at a time and match correclating dataset. Use regex if need be to get desired line or lines.

Maybe if you give more details and dataset you will apprecaite the description in perl codes better. And it will also be good to see your effort too. Let see if we can refactor your perl code with array.

Hello 2teez,

Thank you for the reply and your valuable time. I have altered the gtf file using unix awk and it will look this,

chr1 HAVANA 11869 14412 . + . + . gene_id ENSG00000223972.4 transcript_id ENSG00000223972.4 gene_type pseudogene gene_status KNOWN gene_name DDX11L1 transcript_type pseudogene transcript_status KNOWN
chr1 HAVANA 11869 14409 . + . + . gene_id ENSG00000223972.4 transcript_id ENST00000456328.2 gene_type pseudogene gene_status KNOWN gene_name DDX11L1 transcript_type processed_transcript transcript_status KNOWN

and as you could see the file1 has two columns the question is to look for the match of col1 in file1 to col11 in file2(ie the gtf file the above one).If you find a match then I need to get the output as ,
ENSG00000000003.10 0 gene_type protein_coding
ENSG00000000005.5 0 gene_type protein_coding

That is first two column from the file1 and last two column from file2 (ie column number 14 and column number 15 from GTF file te above one).
My perl does give this output but its takes hours to get me complete output, my PERL program looks like this,

#!usr/bin/perl
$/=undef;
open(INA,$ARGV[0]);
$file1=<INA>;
open(INB,$ARGV[1]);
$file2=<INB>;
@file1=split(/\n/,$file1);
@file2=split(/\n/,$file2);
#while ($sample1=<FILE1>) 
foreach $file1(@file1)
{
    #chomp $sample1;
    #print "$peakid\n";
    ($peakid,@temp1)=split('\t', $file1);
    #$peakid=$temp1[0];
    #$region=$temp1[3];
    #$classi=$temp1[2];
    #$cor=$temp1[0];
    @temp1=join("\t",@temp1);
        #print"$peakid\n";
    #while($sample2=<FILE2>)

    foreach $file2(@file2)
    {
        #chomp $sample2;
        @temp2 =split('\t', $file2);
        #@gtf=split(\;\,$temp2);
        $peakid2=$temp2[10];
                $gene_type=$temp2[13];
        $classi=$temp2[14];
        $region=$temp2[9];
                #print"$peakid2\n";
        @temp2=join("\t",@temp2);
        if( $peakid eq  $peakid2)
        {
            print "$peakid2\t@temp1\t$gene_type\t$classi\n";
            #print"$peakid" 



        }#else{print "$genename2\n";}
    }
}
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.