Hi,
I have a 3-column data (below) from excel file. How can I write a perl code to sort all rows according to the first numeric number of a numeric interval and the word "complement" if they appear before the numeric interval as shown in the expeceted output below?

Data:

81228..81500    Gene 22 
81500..81856    Gene 17 
complement(82256..84292)    Gene 75 
84307..85275    Gene 23  2.7.4.8
complement(97435..98067)    Gene 91 
complement(85325..86527)    Gene 34 3.5.1.32
86756..87025    Gene 36 
complement(92373..93341)    Gene 45 
88076..90292    Gene 34 
complement(90289..92415)    Gene 89  3.6.1.-
93492..94931    Gene 92  2.2.1.1
complement(97087..97347)    Gene 12  2.7.7.6
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
79951..81162    Gene 3   1.7.1.2
complement(87022..87837)    Gene 77
10..1008    Gene 500    
100059..100178  Gene 501    
100470..104660  Gene 502     6.3.5.5
100715..100861  Gene 503    
101721..103604  Gene 504    
101782..103698  Gene 505     1.1.1.49
1018..1137  Gene 506    
10230..11636    Gene 507     6.3.1.2
102328..104049  Gene 508    
10321..12651    Gene 509     2.3.1.16, 2.3.1.9
103256..104290  Gene 510    
103607..104647  Gene 511    
103658..104662  Gene 512     4.1.3.16, 4.1.2.14
103732..106095  Gene 513    
104045..106027  Gene 514    
104057..104305  Gene 515    
10416..14087    Gene 516    
104237..105262  Gene 517     3.1.11.2 

Expected output:

10..1008    Gene 500
1018..1137  Gene 506
10321..12651    Gene 509     2.3.1.16, 2.3.1.9 
10230..11636    Gene 507     6.3.1.2
102328..104049  Gene 508    
100059..100178  Gene 501    
100470..104660  Gene 502     6.3.5.5
100715..100861  Gene 503    
101721..103604  Gene 504    
101782..103698  Gene 505     1.1.1.49
103256..104290  Gene 510    
103607..104647  Gene 511    
103658..104662  Gene 512     4.1.3.16, 4.1.2.14
103732..106095  Gene 513    
104045..106027  Gene 514    
104057..104305  Gene 515    
10416..14087    Gene 516    
104237..105262  Gene 517     3.1.11.2
79951..81162    Gene 3   1.7.1.2
81228..81500    Gene 22 
81500..81856    Gene 17 
84307..85275    Gene 23  2.7.4.8
86756..87025    Gene 36 
88076..90292    Gene 34 
93492..94931    Gene 92  2.2.1.1
complement(82256..84292)    Gene 75 
complement(85325..86527)    Gene 34 3.5.1.32
complement(87022..87837)    Gene 77 
complement(90289..92415)    Gene 89  3.6.1.-
complement(92373..93341)    Gene 45 
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
complement(97087..97347)    Gene 12  2.7.7.6
complement(97435..98067)    Gene 91    

Thank you

Recommended Answers

All 3 Replies

What do you know about perl so far?

First, try to read the each line, and parse it into 3 variables. Then print them out.

Next, instead of printing them out, modify the program to save it in an array (or some data structure). Print out the array.

Next, modify the program to sort the array before printing it out.

Thanks for your suggestions. Here's my attempt that did not work perfectly
as required:

#!C:\Perl\bin\perl.exe

$data = 'input.txt';

open(DATA,"<$data") or die "can't open input $data $!";

while(<DATA>){
    ($location,$GeneName, $ID) = split('\t', $_);
    chomp($ID);
    #$location{$GeneName} = $location;
    #$ID{$GeneName} = $ID;
    $GeneName{$location} = $GeneName;
    $ID{$location} = $ID;
    #$location{$GeneName,$ID} = $location;

}
close(DATA);

open(OUT, ">outFile.csv");
foreach $location (sort keys %GeneName){
    print OUT "$location\t,$GeneName{$location}\t,$ID{$location}\n";
}

close(OUT);

Output:

"10..1008   "   "Gene 500   "       
"100059..100178 "   "Gene 501   "       
"100470..104660 "   "Gene 502   "   6.3.5.5 
"100715..100861 "   "Gene 503   "       
"101721..103604 "   "Gene 504   "       
"101782..103698 "   "Gene 505   "   1.1.1.49    
"1018..1137 "   "Gene 506   "       
"10230..11636   "   "Gene 507   "   6.3.1.2 
"102328..104049 "   "Gene 508   "       
"10321..12651   "   "Gene 509   "   2.3.1.16     2.3.1.9
"103256..104290 "   "Gene 510   "       
"103607..104647 "   "Gene 511   "       
"103658..104662 "   "Gene 512   "   4.1.3.16     4.1.2.14
"103732..106095 "   "Gene 513   "       
"104045..106027 "   "Gene 514   "       
"104057..104305 "   "Gene 515   "       
"10416..14087   "   "Gene 516   "       
"104237..105262 "   "Gene 517   "   3.1.11.2    
"79951..81162   "   "Gene 3 "   1.7.1.2 
"81228..81500   "   "Gene 22    "       
"81500..81856   "   "Gene 17    "       
"84307..85275   "   "Gene 23    "   2.7.4.8 
"86756..87025   "   "Gene 36    "       
"88076..90292   "   "Gene 34    "       
"93492..94931   "   "Gene 92    "   2.2.1.1 
"complement(82256..84292)   "   "Gene 75    "       
"complement(85325..86527)   "   "Gene 34    "   3.5.1.32    
"complement(87022..87837)   "   "Gene 77    "       
"complement(90289..92415)   "   "Gene 89    "   3.6.1.- 
"complement(92373..93341)   "   "Gene 45    "       
"complement(94928..97060)   "   "Gene 58    "   2.5.6.3  3.1.8.2
"complement(97087..97347)   "   "Gene 12    "   2.7.7.6 
"complement(97435..98067)   "   "Gene 91    "       

The rows starting with "1" are not well sorted, and also the 3rd column is split into two in the output whenever there are two numbers. Please ignore the "" - it's a carry-over from excel file output.How can I inprove this code? Thanks

The rows starting with "1" are not well sorted

This is because you're sorting them as strings, not numbers. I'm guessing that you want to sort them by (1) whether or not complement exisits, (2) the first number and (3) the second number. Look into the examples on sort in the perldocs.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.