File Parsing comapring columns within a file

Question

Anna123 0 Light Poster

9 Years Ago

Dear All,

It would be great if someone would help me in rearranging the following file absed on the similarity between the column,
My file looks like this,

CD79B-GH1   ID* 3/2
CD79B-GH1   ID  3/3
CIRBP-C19orf24  RM  2/6
CIRBP-C19orf24  RM* 4/4
DCAKD-NCL   RL  1/2
HMGB2-EHMT2 RL* 5/3
KANSL1-ARL17A   IM  5/13
KANSL1-ARL17B   IM* 4/13
LSM7-PARP6  RE  2/2
RASSF4-ZNF22    RE* 2/1
DCAKD-NCL   RL  1/2
HMGB2-EHMT2 RL* 5/3
KANSL1-ARL17A   IM  5/13
KANSL1-ARL17B   IM* 4/13
LSM7-PARP6  RE  2/2
RASSF4-ZNF22    RE* 2/1

I wanted to get this file in this format where the second coulumn will form a row without duplicates. Andhence my output shold look like a ,like this

ID ID* RM RM*  RL RL* RE RE* IM IM*


CD79B-GH1  3/2 0   0  0   0  0   0   0   0  0
CD79B-GH1    0  3/3 0 0u 0 00 0 0
CIRBP-C19orf240 0 2/6 0
-C19orf24 0 0 0 4/4

like this fill for every new column based on there respective values,and remaning those which doesnt have any value with zero.

Thanks a lot in advance for help.

perl

3 Contributors
15 Replies
228 Views
4 Days Discussion Span
Latest Post 9 Years Ago Latest Post by Anna123

All 15 Replies

Sky Diploma 571 Practically a Posting Shark

9 Years Ago

Thanks a lot in advance for help.

We can surely help you out. Put in some effort and let us know where you're having trouble. Just posting the question and expecting an answer is not what this forum is all about.

Edited 9 Years Ago by Sky Diploma

2teez 43 Posting Whiz

9 Years Ago

Hi anni,
Using the code you posted, I can't reproduce your output. Since the dataset you posted had no tab in it, at least from the forum.

However, using an algothrium similar to yours, I could produce what you wanted, I suppose given that your desired output is not displayed well. You probably should have put it in a code tag.

In your cod, you were splitting on a "\t" character that you don't have as shown in your dataset so your hash was actually empty.

Below is how I did it.

#!/usr/bin/perl -l
use warnings;
use strict;

my %data;

while (<DATA>) {
    my @row = split;
    $data{ $row[1] } = [ @row[ 0, 2 ] ];
}

my @heading;
print "Data name\t", join( "\t", ( @heading = sort keys %data ) );

for my $title ( 0 .. $#heading ) {
    print $data{ $heading[$title] }[0], "\t", "0\t" x $title,
      $data{ $heading[$title] }[1], "\t0" x ( $#heading - $title );
}

__DATA__
CD79B-GH1   ID* 3/2
CD79B-GH1   ID  3/3
CIRBP-C19orf24  RM  2/6
CIRBP-C19orf24  RM* 4/4
DCAKD-NCL   RL  1/2
HMGB2-EHMT2 RL* 5/3
KANSL1-ARL17A   IM  5/13
KANSL1-ARL17B   IM* 4/13
LSM7-PARP6  RE  2/2
RASSF4-ZNF22    RE* 2/1
DCAKD-NCL   RL  1/2
HMGB2-EHMT2 RL* 5/3
KANSL1-ARL17A   IM  5/13
KANSL1-ARL17B   IM* 4/13
LSM7-PARP6  RE  2/2
RASSF4-ZNF22    RE* 2/1

OUTPUT

Data name   ID  ID* IM  IM* RE  RE* RL  RL* RM  RM*
CD79B-GH1   3/3 0   0   0   0   0   0   0   0   0
CD79B-GH1   0   3/2 0   0   0   0   0   0   0   0
KANSL1-ARL17A   0   0   5/13    0   0   0   0   0   0   0
KANSL1-ARL17B   0   0   0   4/13    0   0   0   0   0   0
LSM7-PARP6  0   0   0   0   2/2 0   0   0   0   0
RASSF4-ZNF22    0   0   0   0   0   2/1 0   0   0   0
DCAKD-NCL   0   0   0   0   0   0   1/2 0   0   0
HMGB2-EHMT2 0   0   0   0   0   0   0   5/3 0   0
CIRBP-C19orf24  0   0   0   0   0   0   0   0   2/6 0
CIRBP-C19orf24  0   0   0   0   0   0   0   0   0   4/4

Hope this helps.

Edited 9 Years Ago by 2teez

2teez 43 Posting Whiz

9 Years Ago

Hi anni,

Actually My datafiel is quite big so I have opened it via filehandler

That is what you should do, but you are not doing it in the right way.
Don't use bareword as filehandle like you are doing using the open function.
Rather, use a lexical filehandle, and 3 arugment open function and check the return value like this:

open my $fh, '<', $filename or die "can't open file: $!";

You can get the name of the file you are using from the CLI say like:

my $filename = $ARGV[0];

or you use the CLI arugment or the filename directly from the open function.

and hence by this script(the same you provided).

Yes, want you use is similar but not quite the same.

Check the shebag line, the very first line you are using -w and again using use warnings; you should use either, while one can be turn on and off, the other can't.
Check the usage of my shebag. I use the -l which enable automatic line ending processing, among other things. One of the thing it does is to put the "\n" of the line. So I had no reason to write print "bla..bla..bla, \n";
Please, also check this documentation perldoc perlrun from your CLI. It will show you how to use those switches.

Of course, my script is to guide, and the output is as you expected I suppose.

Hope this helps.

Edited 9 Years Ago by 2teez

2teez 43 Posting Whiz

9 Years Ago

hi anni,

I found one more thing from output its not printing the whole rows from the file.In my fiel I have 166 lines so the output contain tht much lines.Its omiting certain lines,

Unfortunately, I don't have your file. Using just the dataset you provided, the code sampled worked fine, which I suppose you should adopt.

If you wish you might attached your file, maybe the sample you gave are not in-line with the real dataset.

and I have follwing error
Use of uninitialized value in print at test.pl line 12, <DATA> line 12.
Use of uninitialized value in print at test.pl line 12, <DATA> line 12

From the error message your posted, it is obvious you are still using a bareword DATA as your filehandle as shown in the my original post. And you are not opening a filehandle with the function open like I suggested.

Secondly, I suspect your dataset has blanck lines in-between them like this:

CIRBP-C19orf24  RM* 4/4

DCAKD-NCL   RL  1/2
HMGB2-EHMT2 RL* 5/3

which you didn't state in your test dataset you gave. So, to get rid of the that, you could include this line [probably, the first line ] in your while loop:

next if /^\s+$/;

get the next line if current line is a blanck line. Which should solve the error you are getting.

The script I am using now is,

Please write the perl script properly. Use a open function like I showed in the last post before this and understand why. Please ask if you don't understand it, or check perldoc -f open from your CLI.

Hope this helps.

Edited 9 Years Ago by 2teez

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Anna123 0 Light Poster · Answer 1 · 2014-05-03T06:14:54+00:00

Thank you so much for the reply, I amnew to perl. Here is what i tried for,

#!/usr/bin/perl -w

my %data; 
my @names; 
while (<>) {
    chomp;
    my @list=split(/\t/); ## Collect the elements of this line
    for (my $i=1; $i<=$#list; $i++) {

        if ($.==1) {
            $names[$i]=$list[$i];
        }
        ## If it is not the 1st line, collect the data
        else {
            push @{$data{$names[$i]}}, $list[$i];
        }
    }
}
foreach (@names){
    local $"="\t"; ## print tab separated lists
    print "$_\t@{$data{$_}}\n";
}

but this will take the second and thrid column as rowsadit will print it out. I have toget the output as,

ID ID* RM RM* RL RL* RE RE* IM IM*
CD79B-GH1 3/2 0 0 0 0 0 0 0 0 0
CD79B-GH1 0 3/3 0 0u 0 00 0 0
CIRBP-C19orf240 0 2/6 0
-C19orf24 0 0 0 4/4

It would be great help if you couldhel me from here

Anna123 0 Light Poster · Answer 2 · 2014-05-04T06:16:24+00:00

Dear Teez,
Thank you so much for your time and reply,
Actually My datafiel is quite big so I have opened it via filehandler and hence by this script(the same you provided).

#!/usr/bin/perl -w

use warnings;
use strict;
my %data;
open(DATA, "<file.txt") or die "Couldn't open file file.txt, $!";
while (<DATA>) {
    my @row = split;
    $data{ $row[1] } = [ @row[ 0, 2 ] ];
}

my @heading;
print "Data name\t", join( "\t", ( @heading = sort keys %data ) );

for my $title ( 0 .. $#heading ) {
   print $data{ $heading[$title] }[0], "\t", "0\t" x $title,
     $data{ $heading[$title] }[1], "\t0" x ( $#heading - $title );
}


I have the following ouput, which is not taking actual column 2 for header and not matching other two columns.

Data name ID ID* IM* RE_tr RE_tr* RL RM RM_B RM_B*KANSL1-ARL17B 4/13 0 0 0 0 00 0 0ZNF606-C19orf18 0 2/6 0 0 0 0 0 0 0TRIM52-GNB2L1 0 0 1/6 00 0 0 0 0SMG5-PAQR6 0 0 0 2/11 0 0 0 0 0ZEB2-ARHGAP15 0 0 00 1/1 0 0 0 0HSPA4-hsa-mir-6723 0 0 0 0 0 13/3 0 0 0RETN-ATAD2 00 0 0 0 0 5/8 0 0TMEM56-RWDD3 0 0 0 0 0 0 0 2/1 0TYMP-SCO2 0 0 0 0 0 0 0 0 3/16

Anna123 0 Light Poster · Answer 3 · 2014-05-04T08:40:20+00:00

Yes Thank you it helped..:)
Thank you so much..!!

Anna123 0 Light Poster · Answer 4 · 2014-05-04T14:45:31+00:00

Hi Teez,

I found one more thing from output its not printing the whole rows from the file.In my fiel I have 166 lines so the output contain tht much lines.
Its omiting certain lines,
and I have follwing error

Use of uninitialized value in print at test.pl line 12, <DATA> line 12.
Use of uninitialized value in print at test.pl line 12, <DATA> line 12

The script I am using now is,

#!/usr/bin/perl -l
use warnings;
use strict;
my $filename = $ARGV[0];
my %data;
while (<$filename>) {
my @row = split;
$data{ $row[1] } = [ @row[ 0, 2 ] ];
}
my @heading;
print "Data name\t", join( "\t", ( @heading = sort keys %data ) );
for my $title ( 0 .. $#heading ) {
print $data{ $heading[$title] }[0], "\t", "0\t" x $title,
$data{ $heading[$title] }[1], "\t0" x ( $#heading - $title );
}

Anna123 0 Light Poster · Answer 5 · 2014-05-04T16:56:48+00:00

Dear Teez,
It would be extremely great if you could help me sorting it out.I am kind of stuck with this here.
My file is in tab delimited format. with three column and no blank lines in between.
The open function ,like,
this I have tried

my $filename = $ARGV[0];
while($filename)
It didnt opened or read the file as I wanted.So I have used the handler.
these are few lines from the file,I have total of 166 lines

9B-GH1 ID* 3/2
9B-GH1 ID* 3/3
P-C19orf24 ID* 2/6
CIRBP-C19orf24 ID* 4/4
DKD-NCL ID* 1/2
HMGB2-EHMT2 ID* 5/3
KANSL1-ARL17A ID 5/13
KANSL1-ARL17B ID 4/13
L7-PARP6 ID* 2/2
RASSF4-ZNF22 ID* 2/1
UQCRQ-LEAP2 ID* 2/2
Z06-C19orf18 ID* 2/6
C19orf59-TRAPPC5 RM 1/3
HBA1-HBB RM 3/6
HBA1-hsa-mir-6723 RM 3/3
HBA1-MMP9 RM 3/3
HBA2-HBB RM 5/3
HBA2-hsa-mir-6723 RM 3/11
HBA2-hsa-mir-6723 RM 3/2
HBA2-hsa-mir-6723 RM 3/7
HBB-HBA1 RM 12/2
HBB-HBA1 RM 5/2
HBB-HBA1 RM 6/3
HBG1-HBB RM 5/2
RETN-ATAD2 RM 5/8
HSPA4-hsa-mir-6723 RL 13/3
CARKD-ING1 RM_B* 1/6
FAM117A-SLC35B1 RM_B 3/9
HBB-HBA2 RM_B 11/2
HBB-HBA2 RM_B 2/13
HBB-HBA2 RM_B 2/4
HBB-HBA2 RM_B 5/2
HBB-HBA2 RM_B 5/3
TMEM56-RWDD3 RM_B 2/1
TYMP-SCO2 RM_B* 1/1
TYMP-SCO2 RM_B* 3/16
AKAP8L-AKAP8 RE_tr* 2/3
AKAP8L-AKAP8 RE_tr* 4/2
BAIAP2L2-SLC16A8 RE_tr* 5/22
BAIAP2L2-SLC16A8 RE_tr* 7/26
BPTF-LRRC37A2 RE_tr 1/3
BPTF-LRRC37A3 RE_tr 1/3
C15orf57-CBX3 RE_tr 2/3
CHCHD10-VPREB3 RE_tr* 1/1
CHCHD10-VPREB3 RE_tr* 2/4
CTBS-GNG5 RE_tr* 2/23
CTBS-GNG5 RE_tr* 4/20
DHRS1-RABGGTA RE_tr* 12/41
DHRS1-RABGGTA RE_tr* 3/3
DHRS1-RABGGTA RE_tr* 3/5
DHRS1-RABGGTA RE_tr* 7/25
EIF3D-KAT6B RE_tr* 2/2
ADSL-SGSM3 IM* 3/1
ADSL-SGSM3 IM* 3/16
ADSL-SGSM3 IM* 6/17
C7orf50-KLF2 IM* 1/3
DEF6-PPARD IM* 1/6
DEF6-PPARD IM* 2/2
ERGIC1-RPL26L1 IM* 1/4
PLCG2-7SKxxx111xxxxx IM* 2/2
POLR2J-ALKBH4 IM* 2/1
PRKAA1-TTC33 IM* 2/9
SDF4-NADK IM* 1/1

Anna123 0 Light Poster · Answer 6 · 2014-05-04T18:53:22+00:00

Hi again,
I have been using the following script, which is printing out few lines from the file as output but not fully or completely,

    #!/usr/bin/perl -l
    use warnings;
    use strict;
    my $filename = $ARGV[0];
    open my $fh, '<', $filename or die "can't open file: $!";
    my %data;
    while (<$fh>) {
    my @row = split("\t");
    $data{ $row[1] } = [ @row[ 0, 2] ];
    }
    my @heading;
    print "Data name\t", join( "\t", ( @heading = sort keys %data ) );
    foreach my $title ( 0 .. $#heading ) {
    print $data{ $heading[$title] }[0], "\t", "0\t" x $title,
    $data{ $heading[$title] }[1], "\t0" x ( $#heading - $title );
    }

Which print outs only certain lines form the input file it starts from the seventh row of column 1 and then then the 15th row so on..the output looks like this,

Data name ID ID* IM* RE_tr RE_tr* RL RM RM_B RM_B*
KANSL1-ARL17B 4/13
0 0 0 0 0 0 0 0
ZNF606-C19orf18 0 2/6
0 0 0 0 0 0 0
TRIM52-GNB2L1 0 0 1/6
0 0 0 0 0 0
SMG5-PAQR6 0 0 0 2/11
0 0 0 0 0
ZEB2-ARHGAP15 0 0 0 0 1/1
0 0 0 0
HSPA4-hsa-mir-6723 0 0 0 0 0 13/3
0 0 0
RETN-ATAD2 0 0 0 0 0 0 5/8
0 0
TMEM56-RWDD3 0 0 0 0 0 0 0 2/1
0
TYMP-SCO2 0 0 0 0 0 0 0 0 3/16

Thanks a lot for looking into this.!!

2teez 43 Posting Whiz · Answer 7 · 2014-05-04T22:44:49+00:00

Hi anni,

Need I say, you gave a wrong test data, so wrong algothrium and suggestion were also given to what you really intended doing.

There are several rows data names that reoccur with different values, yet in the test data you gave reoccurring data name the same value saying I wanted to get this file in this format where the second coulumn will form a row without duplicates in your first post.

Secondly, though the code used here is basically the same with the ones we have being using, however, we change how the data was been collected for later output.

Using the most recent dataset you gave, I have this output:

Data name        ID ID* IM* RE_tr   RE_tr*  RL  RM  RM_B    RM_B*
KANSL1-ARL17A       5/13    0   0   0   0   0   0   0   0
KANSL1-ARL17B        4/13   0   0   0   0   0   0   0   0
P-C19orf24       0  2/6 0   0   0   0   0   0   0
L7-PARP6         0  2/2 0   0   0   0   0   0   0
9B-GH1       0  3/2 0   0   0   0   0   0   0
9B-GH1       0  3/3 0   0   0   0   0   0   0
DKD-NCL      0  1/2 0   0   0   0   0   0   0
UQCRQ-LEAP2      0  2/2 0   0   0   0   0   0   0
RASSF4-ZNF22         0  2/1 0   0   0   0   0   0   0
CIRBP-C19orf24       0  4/4 0   0   0   0   0   0   0
HMGB2-EHMT2      0  5/3 0   0   0   0   0   0   0
Z06-C19orf18         0  2/6 0   0   0   0   0   0   0
SDF4-NADK        0  0   1/1 0   0   0   0   0   0
POLR2J-ALKBH4        0  0   2/1 0   0   0   0   0   0
ERGIC1-RPL26L1       0  0   1/4 0   0   0   0   0   0
DEF6-PPARD       0  0   1/6 0   0   0   0   0   0
DEF6-PPARD       0  0   2/2 0   0   0   0   0   0
PRKAA1-TTC33         0  0   2/9 0   0   0   0   0   0
PLCG2-7SKxxx111xxxxx         0  0   2/2 0   0   0   0   0   0
ADSL-SGSM3       0  0   3/1 0   0   0   0   0   0
ADSL-SGSM3       0  0   3/16    0   0   0   0   0   0
ADSL-SGSM3       0  0   6/17    0   0   0   0   0   0
C7orf50-KLF2         0  0   1/3 0   0   0   0   0   0
C15orf57-CBX3        0  0   0   2/3 0   0   0   0   0
BPTF-LRRC37A3        0  0   0   1/3 0   0   0   0   0
BPTF-LRRC37A2        0  0   0   1/3 0   0   0   0   0
CTBS-GNG5        0  0   0   0   2/23    0   0   0   0
CTBS-GNG5        0  0   0   0   4/20    0   0   0   0
BAIAP2L2-SLC16A8         0  0   0   0   5/22    0   0   0   0
BAIAP2L2-SLC16A8         0  0   0   0   7/26    0   0   0   0
DHRS1-RABGGTA        0  0   0   0   12/41   0   0   0   0
DHRS1-RABGGTA        0  0   0   0   3/3 0   0   0   0
DHRS1-RABGGTA        0  0   0   0   3/5 0   0   0   0
DHRS1-RABGGTA        0  0   0   0   7/25    0   0   0   0
CHCHD10-VPREB3       0  0   0   0   1/1 0   0   0   0
CHCHD10-VPREB3       0  0   0   0   2/4 0   0   0   0
EIF3D-KAT6B      0  0   0   0   2/2 0   0   0   0
AKAP8L-AKAP8         0  0   0   0   2/3 0   0   0   0
AKAP8L-AKAP8         0  0   0   0   4/2 0   0   0   0
HSPA4-hsa-mir-6723       0  0   0   0   0   13/3    0   0   0
RETN-ATAD2       0  0   0   0   0   0   5/8 0   0
C19orf59-TRAPPC5         0  0   0   0   0   0   1/3 0   0
HBA1-hsa-mir-6723        0  0   0   0   0   0   3/3 0   0
HBA2-hsa-mir-6723        0  0   0   0   0   0   3/11    0   0
HBA2-hsa-mir-6723        0  0   0   0   0   0   3/2 0   0
HBA2-hsa-mir-6723        0  0   0   0   0   0   3/7 0   0
HBA1-MMP9        0  0   0   0   0   0   3/3 0   0
HBA2-HBB         0  0   0   0   0   0   5/3 0   0
HBG1-HBB         0  0   0   0   0   0   5/2 0   0
HBB-HBA1         0  0   0   0   0   0   12/2    0   0
HBB-HBA1         0  0   0   0   0   0   5/2 0   0
HBB-HBA1         0  0   0   0   0   0   6/3 0   0
HBA1-HBB         0  0   0   0   0   0   3/6 0   0
HBB-HBA2         0  0   0   0   0   0   0   11/2    0
HBB-HBA2         0  0   0   0   0   0   0   2/13    0
HBB-HBA2         0  0   0   0   0   0   0   2/4 0
HBB-HBA2         0  0   0   0   0   0   0   5/2 0
HBB-HBA2         0  0   0   0   0   0   0   5/3 0
TMEM56-RWDD3         0  0   0   0   0   0   0   2/1 0
FAM117A-SLC35B1      0  0   0   0   0   0   0   3/9 0
CARKD-ING1       0  0   0   0   0   0   0   0   1/6
TYMP-SCO2        0  0   0   0   0   0   0   0   1/1
TYMP-SCO2        0  0   0   0   0   0   0   0   3/16

Which I suppose, is the output you are looking for.

This is how the code looks like:

#!/usr/bin/perl
use warnings;
use strict;

my %data;

open my $fh, '<', $ARGV[0] or die "can't open file: $!";
while (<$fh>) {
    my @row = split;
    push @{ $data{ $row[1] }{ $row[0] } } => $row[2];
}
close $fh or die "can't close file: $!";

my @heading;
print "Data name\t\t ", join( "\t", ( @heading = sort keys %data ) ), $/;

for my $title ( 0 .. $#heading ) {
    for my $value ( keys %{ $data{ $heading[$title] } } ) {
        print map {
            $value, "\t\t ", "0\t" x $title, $_, "\t0" x ( $#heading - $title ),
              $/
        } @{ $data{ $heading[$title] }{$value} };
    }
}

Please note several things have changed significantly.
You also need to read perldsc using your perldoc command from your CLI. It will help you in understanding Data Structure in perl.

Anna123 0 Light Poster · Answer 8 · 2014-05-05T07:49:56+00:00

Hi Teez,

I am sorry somehow when I copy it here,the data looks like tht..And of course the second script for larger datasets is working. I am using ubutu os so I cant find the package to read perldoc and desciption:may be I will look online tutorials.
Thank you so much for your effective help and valueable time.
Million Thanks..:)

Anna123 0 Light Poster · Answer 9 · 2014-05-05T13:44:08+00:00

Hi Teez,

In the output there is one more thing to be looked into,as you could see many rows from fiel repeats it prints all rows but I just want it in one single row in output,

for example the data_name(DHRS1-RABGGTA) in column1 is repeating more tahn once and its printed in 4 rows.

DHRS1-RABGGTA 0 0 12/41 0 0 0 0 0 0 0
DHRS1-RABGGTA 0 0 0 7/25 0 0 0 0 0 0
DHRS1-RABGGTA 0 0 0 3/3 0 0 0 0 0 0
DHRS1-RABGGTA 0 0 0 0 0 3/5 0 0 0 0

Instead Of this I wanted it in single row like this,

DHRS1-RABGGTA 0 0 12/41 7/25 3/3 0 0 3/5 0 0 0

So as for all these kind of repeats
Thank you

2teez 43 Posting Whiz · Answer 10 · 2014-05-05T21:41:02+00:00

Hi anni,

You have all you needed. Play around the last script you have. You should be able to get that done as an execrise.

It is really not difficult.
Hint: Check the usage of the map function.

Read up the connection between a for loop and a map

Anna123 0 Light Poster · Answer 11 · 2014-05-07T14:40:15+00:00

Hu teez,
Thanks for support and help.
Yes It helped me..!!!

File Parsing comapring columns within a file

Recommended Answers Collapse Answers

All 15 Replies

Recommended Answers