how to compare the name and find amino acid from the diffirent sequence

Question

biojet 0 Junior Poster in Training

13 Years Ago

Hi all,

I am trying to make the script to find 3 amino acid at the position form the diffirent sequence into the same file data.
in put 1:

Name            posi
 14067_contig01	 18
 14067_contig05	 8 
 14067_contig03	 26
.......................

in put 2:

>gi|354512101|gb|AGQQ01000001.1| Corynebacterium glutamicum ATCC 14067 Contig01, whole genome shotgun sequence
TTAGCCAGGAAACGCTTCGCTGCCGCGACGT
>gi|354512096|gb|AGQQ01000002.1| Corynebacterium glutamicum ATCC 14067 Contig02, whole genome shotgun sequence
TTAGCCAGGAAACGCTTCGCTGCCGCGACGTTGCGCTTCGGAGAGAGGTAAAAGTCCAGG
TTGAGGGTGGGGCGCAGCAGGAGCATCGCAATCGCGA
>gi|354511893|gb|AGQQ01000003.1| Corynebacterium glutamicum ATCC 14067 Contig03, whole genome shotgun sequence
GTTTGTATCCGTATTACTGCGGATCGTATCGAAGAGGGCGTCGGGAAGGTCAAACGCGCC
>gi|354511890|gb|AGQQ01000004.1| Corynebacterium glutamicum ATCC 14067 Contig04, whole genome shotgun sequence
GGTGTGTAAATTAATTCCAGTCAGCGCGACCAACAACGCC
>gi|354511864|gb|AGQQ01000005.1| Corynebacterium glutamicum ATCC 14067 Contig05, whole genome shotgun sequence
CGTGCCTTGCCCTTTCCGCAATGAGTGCATCGCCTCCATCCTTTCAACGTCCGATATGCA

I hope out put :

Name               posi          amino acid 
 14067_contig01	 18               CGC
 14067_contig05	 8                TGC
 14067_contig03	 26               GTA
............................................

Could you please show me to solve this problems.

perl

data1.txt (0.08 KB)

Name            posi
 14067_contig01	 18
 14067_contig05	 8 
 14067_contig03	 26

data2.txt (0.84 KB)

>gi|354512101|gb|AGQQ01000001.1| Corynebacterium glutamicum ATCC 14067 Contig01, whole genome shotgun sequence
TTAGCCAGGAAACGCTTCGCTGCCGCGACGT
>gi|354512096|gb|AGQQ01000002.1| Corynebacterium glutamicum ATCC 14067 Contig02, whole genome shotgun sequence
TTAGCCAGGAAACGCTTCGCTGCCGCGACGTTGCGCTTCGGAGAGAGGTAAAAGTCCAGG
TTGAGGGTGGGGCGCAGCAGGAGCATCGCAATCGCGA
>gi|354511893|gb|AGQQ01000003.1| Corynebacterium glutamicum ATCC 14067 Contig03, whole genome shotgun sequence
GTTTGTATCCGTATTACTGCGGATCGTATCGAAGAGGGCGTCGGGAAGGTCAAACGCGCC
>gi|354511890|gb|AGQQ01000004.1| Corynebacterium glutamicum ATCC 14067 Contig04, whole genome shotgun sequence
GGTGTGTAAATTAATTCCAGTCAGCGCGACCAACAACGCC
>gi|354511864|gb|AGQQ01000005.1| Corynebacterium glutamicum ATCC 14067 Contig05, whole genome shotgun sequence
CGTGCCTTGCCCTTTCCGCAATGAGTGCATCGCCTCCATCCTTTCAACGTCCGATATGCA

2 Contributors
4 Replies
312 Views
2 Days Discussion Span
Latest Post 13 Years Ago Latest Post by biojet

All 4 Replies

d5e5 109 Master Poster

13 Years Ago

#!/usr/bin/perl
use strict; 
use warnings; 

my %aas; #Hash to store amino acids

read_amino_acids('data2.txt');

read_positions('data1.txt');

sub read_amino_acids{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    my ($name, $key);
    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my @flds = split /\|/;
        if (@flds > 1){
            $name = $flds[4];
            $name =~ m/(\d+)\s(\w+\d\d)/;
            $key = lc("$1_$2");
            undef $aas{$key} unless exists $aas{$key};
        }
        else{
            $aas{$key} .= $_;
        }
    }
}

sub read_positions{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    
    print "Name               posi          amino acid\n";

    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my ($name, $pos) = split;
        next unless $name =~ m/^\d+_/;
        my $tuple = substr $aas{$name}, $pos - 1, 3;
        printf "%s%7d%16s\n", ($name,$pos,$tuple);
    }
}

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

biojet 0 Junior Poster in Training · Answer 1 · 2012-02-22T15:17:13+00:00

#!/usr/bin/perl
use strict; 
use warnings; 

my %aas; #Hash to store amino acids

read_amino_acids('data2.txt');

read_positions('data1.txt');

sub read_amino_acids{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    my ($name, $key);
    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my @flds = split /\|/;
        if (@flds > 1){
            $name = $flds[4];
            $name =~ m/(\d+)\s(\w+\d\d)/;
            $key = lc("$1_$2");
            undef $aas{$key} unless exists $aas{$key};
        }
        else{
            $aas{$key} .= $_;
        }
    }
}

sub read_positions{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    
    print "Name               posi          amino acid\n";

    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my ($name, $pos) = split;
        next unless $name =~ m/^\d+_/;
        my $tuple = substr $aas{$name}, $pos - 1, 3;
        printf "%s%7d%16s\n", ($name,$pos,$tuple);
    }
}

wooh it is wonderful. Thanks d5e5!
Could you show me the mean of two sentence and why we should use that ?
$name =~ m/(\d+)\s(\w+\d\d)/;
$key = lc("$1_$2");
I just know
\d+ : Matches a digit [0-9].
\w+ : matches an anpha chater
\s: Matches a whitespace character
Thank you so much.

d5e5 109 Master Poster · Answer 2 · 2012-02-22T22:36:57+00:00

wooh it is wonderful. Thanks d5e5!
Could you show me the mean of two sentence and why we should use that ?
$name =~ m/(\d+)\s(\w+\d\d)/;
$key = lc("$1_$2");
I just know
\d+ : Matches a digit [0-9].
\w+ : matches an anpha chater
\s: Matches a whitespace character
Thank you so much.

$name =~ m/(\d+)\s(\w+\d\d)/; means that when the text in $name contains a string of one or more digits followed by a space followed by one or more alphanumeric characters followed by exactly two digits we want the first string of digits saved in $1 and the alphanumeric string and the following two digits saved in $2.

Because we don't want to use all the text in $name, we capture the desired text by means of parentheses in the regex pattern. When a match occurs, the value corresponding to the pattern in the first parentheses is captured into the special Perl variable $1 and the value corresponding to the pattern in the second parentheses is captured into $2. We want the key in the hash to be the same as what we will read in the data1.txt so we put $1 and $2 together in a string with an underscore _ between them. See Extracting Matches.

\d matches one digit.
\d+ matches one or more digits (See Matching Repetitions.
"\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts" See Using Character Classes $key = lc("$1_$2"); means $key will contain only lower-case characters corresponding to the string containing value of $1 followed by _ followed by value of $2.
We have to make the contents of $key lower-case ('14067_contig01' not '14067_Contig01') and that is what lc() does.

biojet 0 Junior Poster in Training · Answer 3 · 2012-02-23T15:13:51+00:00

Thanks a lot d5e5!

$name =~ m/(\d+)\s(\w+\d\d)/; means that when the text in $name contains a string of one or more digits followed by a space followed by one or more alphanumeric characters followed by exactly two digits we want the first string of digits saved in $1 and the alphanumeric string and the following two digits saved in $2.
Because we don't want to use all the text in $name, we capture the desired text by means of parentheses in the regex pattern. When a match occurs, the value corresponding to the pattern in the first parentheses is captured into the special Perl variable $1 and the value corresponding to the pattern in the second parentheses is captured into $2. We want the key in the hash to be the same as what we will read in the data1.txt so we put $1 and $2 together in a string with an underscore _ between them. See Extracting Matches.
\d matches one digit.
\d+ matches one or more digits (See Matching Repetitions.
"\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts" See Using Character Classes $key = lc("$1_$2"); means $key will contain only lower-case characters corresponding to the string containing value of $1 followed by _ followed by value of $2.
We have to make the contents of $key lower-case ('14067_contig01' not '14067_Contig01') and that is what lc() does.

how to compare the name and find amino acid from the diffirent sequence

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers