Dear All,

I have two files file1 looks like this,the test file

ENSG00000000003.10  0
ENSG00000000005.5   0
ENSG00000000419.8   392
ENSG00000000457.8   24

and the GTF file like this,

chr1    HAVANA  11869   14412   .   +   .   +   .   gene_id ENSG00000223972.4   transcript_id   ENSG00000223972.4   gene_type   pseudogene  gene_status KNOWN   gene_name   DDX11L1 transcript_type pseudogene  transcript_status   KNOWN
chr1    HAVANA  11869   14409   .   +   .   +   .   gene_id ENSG00000223972.4   transcript_id   ENST00000456328.2   gene_type   pseudogene  gene_status KNOWN   gene_name   DDX11L1 transcript_type processed_transcript    transcript_status   KNOWN
chr1    HAVANA  11869   12227   .   +   .   +   .   gene_id ENSG00000223972.4   transcript_id   ENST00000456328.2   gene_type   pseudogene  gene_status KNOWN   gene_name   DDX11L1 transcript_type processed_transcript    transcript_status   KNOWN
chr1    HAVANA  12613   12721   .   +   .   +   .   gene_id ENSG00000223972.4   transcript_id   ENST00000456328.2   gene_type   pseudogene  gene_status KNOWN   gene_name   DDX11L1 transcript_type processed_transcript    transcript_status   KNOWN
chr1    HAVANA  13221   14409   .   +   .   +   .   gene_id ENSG00000223972.4   transcript_id   ENST00000456328.2   gene_type   pseudogene  gene_status KNOWN   gene_name   DDX11L1 transcript_type processed_transcript    transcript_status   KNOWN

My requiremnt is to comapre the column number one of file one to the 11 th column of file 2 and print out if matching with 13th and 14th column from the file2.

The output should look like this,

ENSG00000000003.10  0   gene_type   protein_coding
ENSG00000000005.5   0   gene_type   protein_coding
ENSG00000000419.8   392 gene_type   protein_coding
ENSG00000000457.8   248 gene_type   protein_coding

Please help me a solution using Perl Hash I have a script In Perl arrays but since my file is huge its taking long time..It would be great if I could have something in PERL HASHES.

Thank You ALL;

Recommended Answers

All 11 Replies

hello Anna123,
As much as I would like to help, I really don't see a correclation between the sample of the files you have posted. Or am I missing something?
Please, could you described or give more or better data sample and tell what you are trying to match.

That said, what I would have advised is that open the two files, get the first file into an hash data type, then go over the second file, a step at a time and match correclating dataset. Use regex if need be to get desired line or lines.

Maybe if you give more details and dataset you will apprecaite the description in perl codes better. And it will also be good to see your effort too. Let see if we can refactor your perl code with array.

Hello 2teez,

Thank you for the reply and your valuable time. I have altered the gtf file using unix awk and it will look this,

chr1 HAVANA 11869 14412 . + . + . gene_id ENSG00000223972.4 transcript_id ENSG00000223972.4 gene_type pseudogene gene_status KNOWN gene_name DDX11L1 transcript_type pseudogene transcript_status KNOWN
chr1 HAVANA 11869 14409 . + . + . gene_id ENSG00000223972.4 transcript_id ENST00000456328.2 gene_type pseudogene gene_status KNOWN gene_name DDX11L1 transcript_type processed_transcript transcript_status KNOWN

and as you could see the file1 has two columns the question is to look for the match of col1 in file1 to col11 in file2(ie the gtf file the above one).If you find a match then I need to get the output as ,

ENSG00000000003.10  0   gene_type   protein_coding

ENSG00000000005.5   0   gene_type   protein_coding

That is first two column from the file1 and last two column from file2 (ie column number 14 and column number 15 from GTF file te above one).
My perl does give this output but its takes hours to get me complete output, my PERL program looks like this,

#!usr/bin/perl
$/=undef;
open(INA,$ARGV[0]);
$file1=<INA>;
open(INB,$ARGV[1]);
$file2=<INB>;
@file1=split(/\n/,$file1);
@file2=split(/\n/,$file2);
#while ($sample1=<FILE1>) 
foreach $file1(@file1)
{
    #chomp $sample1;
    #print "$peakid\n";
    ($peakid,@temp1)=split('\t', $file1);
    #$peakid=$temp1[0];
    #$region=$temp1[3];
    #$classi=$temp1[2];
    #$cor=$temp1[0];
    @temp1=join("\t",@temp1);
        #print"$peakid\n";
    #while($sample2=<FILE2>)

    foreach $file2(@file2)
    {
        #chomp $sample2;
        @temp2 =split('\t', $file2);
        #@gtf=split(\;\,$temp2);
        $peakid2=$temp2[10];
                $gene_type=$temp2[13];
        $classi=$temp2[14];
        $region=$temp2[9];
                #print"$peakid2\n";
        @temp2=join("\t",@temp2);
        if( $peakid eq  $peakid2)
        {
            print "$peakid2\t@temp1\t$gene_type\t$classi\n";
            #print"$peakid" 



        }#else{print "$genename2\n";}
    }
}

Hi anni,
Sorry this is coming a bit late. But I still don't get what you are trying to compare in the two files. What are the values in the first file you want to compare with in the second file to get your output. You mentioned col1 in file1 and col 11 in file2 which I still don't see anything comparable except for ENSG and ENST in both files and the numbers of zeros that follows don't even compare.

I did try to use your code to see what I can get if that could be a pointer, but yet I have no output using the code you presented though there is a better way to write what you are writting in perl.

I try to re-write want you intended to see separate the columns so that you will see that am talking about.

Below is the code that separate the columns for boths files and display them. A re-write from the code you presented.

#!usr/bin/perl 
use warnings;
use strict;

my $file1 = read_file( $ARGV[0] );
my $file2 = read_file( $ARGV[1] );

#print out all the files
print map {join ($/ => @{$_}), "\n===\n\n\n"} $file1, $file2;


sub read_file {
    open my $fh, '<', shift(@_) or die "can't open file: $!";

    my @file = do { local $/; split /\s+/ => <$fh> };

    return \@file;
}

OUTPUT

ENSG00000000003.10
0
ENSG00000000005.5
0
ENSG00000000419.8
392
ENSG00000000457.8
24
===


chr1
HAVANA
11869
14412
.
+
.
+
.
gene_id
ENSG00000223972.4
transcript_id
ENSG00000223972.4
gene_type
pseudogene
gene_status
KNOWN
gene_name
DDX11L1
transcript_type
pseudogene
transcript_status
KNOWN
chr1
HAVANA
11869
14409
.
+
.
+
.
gene_id
ENSG00000223972.4
transcript_id
ENST00000456328.2
gene_type
pseudogene
gene_status
KNOWN
gene_name
DDX11L1
transcript_type
processed_transcript
transcript_status
KNOWN
chr1
HAVANA
11869
12227
.
+
.
+
.
gene_id
ENSG00000223972.4
transcript_id
ENST00000456328.2
gene_type
pseudogene
gene_status
KNOWN
gene_name
DDX11L1
transcript_type
processed_transcript
transcript_status
KNOWN
chr1
HAVANA
12613
12721
.
+
.
+
.
gene_id
ENSG00000223972.4
transcript_id
ENST00000456328.2
gene_type
pseudogene
gene_status
KNOWN
gene_name
DDX11L1
transcript_type
processed_transcript
transcript_status
KNOWN
chr1
HAVANA
13221
14409
.
+
.
+
.
gene_id
ENSG00000223972.4
transcript_id
ENST00000456328.2
gene_type
pseudogene
gene_status
KNOWN
gene_name
DDX11L1
transcript_type
processed_transcript
transcript_status
KNOWN
===

Maybe if you can attach your file sample, one can see clearly what you wanted to compare and output.

Hi ,

Thank you for the time and effort. I am located in Germany thats why the delay in replies.I am tryingto annotate the first column in file 1 with second file.And I am a beginner in PERL, so the script looks so lentghy

That is the first column of the file 1 is GENEID(ENSG)So as you can see in the file 2 the the column number 11(starts with ENSG) is also geneid. And the column number 13 and 14 in file2 is there genetype.

So the question is to find if there is any matching geneid present in file2 from file1. for example in first row of file1 is

ENSG00000000003.10 0

And first row of file2 is,

chr1 HAVANA 11869 14412 . + . + . gene_id ENSG00000000003.10 transcript_id ENSG00000223972.4 gene_type pseudogene gene_status KNOWN gene_name DDX11L1 transcript_type pseudogene transcript_status KNOWN
So as you can see the first column of file1 matche sto 10th column in file2 and hence I want the ouput as,

ENSG00000000003.10 0 gene_type pseudogene

That is the corresponding gene_type is added as a extra column to file1 in output.

Please let me know if you want more information. And it would be extremely great if you could help me.The thing is my script give me this output but its taking long time.

Tahnk you.!!

Hi anni,
Using the modified dataset you sent. You will observed that there is a correclation between the two files you presented.

Using hash then, is simple to come-by like I mentioned in my first post.

However, before showing you that, let's take a look at the code you posted. Though you tired since you just started using perl. And of course, in perl there is more than one way to do it, however several of those are not efficient and effective.

Use both warnings and strict It will help keep you from working several hours looking for a typo! In your script.

Secondly, when you open a file in perl you want to check the return value whether it fails or not. Then you should be using 3-argument open function. Don't use bareword filehandle, instead use a lexical scoped filehandles.

You could want to check perlintro from the command line interface on your system. like this: perldoc perlintro.

With all that said, the following should give you what you want.
Open both files, but the values in the first file in an hash data type, then step through the second one then print all the values you wanted if a match is located. That simple. Like thus:

#!usr/bin/perl 
use warnings;
use strict;

# make sure the user gives two files from the CLI
# else get an error message
die "You must provide two files" unless @ARGV == 2;

my %gen_id;

# open the first file, split and save in an hash
open my $fh, '<', $ARGV[0] or die "can't open file: $!";
while (<$fh>) {
    my ( $id, $value ) = split;
    $gen_id{$id} = $value;
}
close $fh or die "can't close file: $!";

#open the second file, then step through it and find
#match values to display
open $fh, '<', $ARGV[1] or die "can't open file: $!";
while (<$fh>) {

    # get only values in indexes 11,13 & 14,
    # starting from 0
    foreach my $file ( [split] ) {
        print
          join( "\t" => $file->[10], $gen_id{ $file->[10] }, @$file[ 13, 14 ] ),
          $/
          if exists $gen_id{ $file->[10] };

    }
}
close $fh or die "can't close file: $!";

Yes, it does work, but why repeat ourselves?
since the open function does the same thing we can write it once and use it as many times as possible. You can replace the foreach loop with map, shorten the whole scripts with several lines.

Hope this helps.

Hello,

Thank you soo much for the hash script this works faster and give sme output but the thing is it is giving me a lot of duplicates,say for example the output looks like this,

ENSG00000223972.4   0   gene_type   pseudogene
ENSG00000223972.4   0   gene_type   pseudogene
ENSG00000223972.4   0   gene_type   pseudogene

That is its printing the ENSG* as much as time it is repeating in GTF file. 59751 lines to annotate and after runing the program I have a output file with 2608585 lines tht means I have a lot of duplicated lines may I know How could I remove it in the perl program I know I could use awk piped in perl command line but if there is a efficinet way out please let me know

Hi anni,

How would you make the change you want, to get the output you so desired? Please give it a trial, then I will help if you are not getting it.

Hi anni,
I want to believe you are not making an head with get the desire output.

What you need do is use an hash, since hash can't have duplicated keys. Change the last for loop to use an hash instead of printing matched string like thus:

#open the second file, then step through it and find
#match values to display

my %get_data;

open $fh, '<', $ARGV[1] or die "can't open file: $!";
while (<$fh>) {

    # get only values in indexes 11,13 & 14,
    # starting from 0
    foreach my $file ( [split] ) {
        if ( exists $gen_id{ $file->[10] } ) {

            $get_data{"$file->[10]\t$gen_id{ $file->[10] }"} =
              join "\t" => @$file[ 13, 14 ];
        }

    }
}
close $fh or die "can't close file: $!";

print $_, "\t", $get_data{$_}, $/ for sort keys %get_data;

Get what you wanted into the hash and use another for loop to print out what you wanted. Please note the usage of %get_data.

Hope this helps.

Hi ,

thank you so much.!
I ahve tried not in perl program its jsut by piping awk!x[$0]++ to the perl command line. And it helped, but I didnt change on the script. Thank so much for the assistance I will try with the new code.

Okay try it out and let's know what you come up with, but it should work I believe. However, note that my last post is a modified part of the post before it.

If it all works as you suppose, please mark this post solved.

Thanks.

Thank you so much and I marked as solved.
Perl annotation.pl file1.txt file_2.gtf | awk !x[$0]++ >output
Will also remove duplicates

Thanks a lot for your efficent and great support.!!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.