0

Hi, I have a file like this, separeted by tabs:

  Code  Info              source          start     end    

GB01672 rpsblast_to_CDD     protein_match       
GB01672 rpsblast_to_CDD     match_part  296     988
GB01673 rpsblast_to_CDD     protein_match       
GB01673 rpsblast_to_CDD     match_part  3803    4147
GB01673 rpsblast_to_CDD     match_part  1314    1907
GB01673 rpsblast_to_CDD     match_part  3516    3932
GB01673 rpsblast_to_CDD     match_part  3335    3463
GB01674 rpsblast_to_CDD     protein_match       
GB01674 rpsblast_to_CDD     match_part  3724    406
GB01674 rpsblast_to_CDD     match_part  1314    1907
GB01674 rpsblast_to_CDD     match_part  3335    385

All the start and end of the row protein_match are empty. So I need to fill the start and end of the protein_match for each code. The start is the lower number of the match_part (which is just below the protein match for each code)and the end is the high number of the match_part. Taking in mind that each code coul have several match_parts.

For example, the output for the code GB01673 have to be:

 GB01673 rpsblast_to_CDD    protein_match      **1314   4147**      
GB01673 rpsblast_to_CDD     match_part  3803    4147
GB01673 rpsblast_to_CDD     match_part  1314    1907
GB01673 rpsblast_to_CDD     match_part  3516    3932
GB01673 rpsblast_to_CDD     match_part  3335    3463

I really appreciate if someone can help me!!!

Thanks

2
Contributors
1
Reply
2
Views
5 Years
Discussion Span
Last Post by d5e5
0
#!/usr/bin/perl
use strict;
use warnings;

my @array = <DATA>;

my ($s_min, $e_max) = (99999, 0);
foreach (reverse @array){
    chomp;
    next if m/^\s/;
    my ($code, $source, $start, $end) = (split)[0,2,3,4];
    if ($source eq 'match_part'){
        $s_min = $s_min < $start ? $s_min : $start;
        $e_max = $e_max > $end ? $e_max : $end;
    }
    else{
        $start = $s_min;
        $end = $e_max;
        ($s_min, $e_max) = (99999, 0);
    }
    $_ = join "\t", $code, $source, $start, $end;
}

foreach (@array){
    print "$_\n";
}
__DATA__
  Code  Info              source          start     end    
GB01672 rpsblast_to_CDD     protein_match       
GB01672 rpsblast_to_CDD     match_part  296     988
GB01673 rpsblast_to_CDD     protein_match       
GB01673 rpsblast_to_CDD     match_part  3803    4147
GB01673 rpsblast_to_CDD     match_part  1314    1907
GB01673 rpsblast_to_CDD     match_part  3516    3932
GB01673 rpsblast_to_CDD     match_part  3335    3463
GB01674 rpsblast_to_CDD     protein_match       
GB01674 rpsblast_to_CDD     match_part  3724    406
GB01674 rpsblast_to_CDD     match_part  1314    1907
GB01674 rpsblast_to_CDD     match_part  3335    385
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.