I am having two file (File A and File B) of sequence. In File A, thousand of sequences are in fasta format (For example:>seq1
GGTGTTTGCGTGGTTTCATGAAGGTTTTACTaTCCCGAGGAGCCTCATTAAATTGGCAAGA
CTCATTAAATTGGCAAGAGGTGTTTGCGTGGTTTTGGCAAGAGGTGTTTGCGGGTTTTAAA

seq2
CTCATTAAATTGGCAAGAGGTGTTTGCGTGGTTTTGGCAAGAGGTGTTTGCGGGTTTTAAA
TTTGCGTGGTTTCATGAAGGTTTTACTaTCCCGAGGAGCCTCATTAAATTGGCAAAAAAAA

In File B, Two differnt things are avilable first is the sequence number i.e. >seq1 and its position number 10-120 similarly others pattern are like >seq1, 8-84 , >seq2 11-45 are avilable.

I want to first search sequence file of File B i.e. >seq in File A if it is matched than extract the specific position of sequence i.e. 10-120 and get output in other file with sequence like >seq1 TTGCGTGGTTTCATGAAGGTTTTACTaTCCCGAGGAGCCTCATTAAATTGGCAAGACTCATTAAATTGGCAAGAGGTGTTTGCGTGGTTTTGG.

I will really thankful for kind of help.

Recommended Answers

All 2 Replies

Hi Jassipox,

There are several ways of solving this problem. I would have really love it, if you have shown what you have tried so far on your own. However, I will show a solution here, that really solved the problem you posted. And I can only wish you take it as a guide to knowing and working with Perl the more.
Here we go:

#!/usr/bin/perl
use warnings;
use strict;

my $seq_name_n_range;    # variable to take in names & range of seq.

open my $fh,  '<', 'seq_name_pos.txt' or die "can't open file: $!";
open my $fh2, '<', 'seq1.txt'         or die "can't open file: $!";
<$fh>;                   ## remove the header
while (<$fh>) {
    chomp;
    next if /^$/;
    my ( $name, $intial_position, $final_position ) = split;
    push @{$seq_name_n_range},
      { name => $name, position => [ $intial_position, $final_position ], };

}
{
    local $/ = ">";      ## change the input record seperator value

    while (<$fh2>) {
        chomp;
        next if /^$/;
        my ( $seq_name, $data ) = split /\n/, $_;

        while ( my ( $key, $value ) = each @{$seq_name_n_range} ) {

            if ( $value->{'name'} eq $seq_name ) {

                output_string(    # call subroutine output_string
                    {             # pass anonymous hash
                        data     => $data,
                        position => $value->{'position'},
                        name     => $value->{'name'},
                    }
                );
            }
        }
    }
    close $fh2 or die "can't close file: $!";

}
close $fh or die "can't close file: $!";

sub output_string {
    my ($input_has_ref) = @_;

    printf "Name of Sequence: %s  Range:%s  -  %s\n", $input_has_ref->{'name'},
      $input_has_ref->{'position'}[0], $input_has_ref->{'position'}[1];

    push my @data_array, split //, $input_has_ref->{'data'};

    print @data_array[ ( $input_has_ref->{'position'}[0] )
      .. ( $input_has_ref->{'position'}[1] ) ], "\n\n";
}

My Output can be seen the file attached to this mail. Please check it.
The only issue I have with this code is that all the counting starts FROM 0 NOT 1. i.e All the array start from 0. If that is what you want fine, you could use this without any change. But if you want you counting to start from 1, you might have to change this line

print @data_array[ ( $input_has_ref->{'position'}[0] )
      .. ( $input_has_ref->{'position'}[1] ) ], "\n\n";

a bit. And that you have to figure out. It really simple.
I can only hope this helps.
cheers,
~ 2t

You may also want to look at the Bio::DB::Fasta module. I haven't tried it but according to the docs it has the following method to extract specific substrings from fasta files:

$raw_seq = $db->seq($id [,$start, $stop])
Return the raw sequence (a string) given an ID and optionally a start and stop position in the sequence.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.