When parsing an EMBL record (attached) do I follow the same directions as when I parse a GENBANK record? I have to print out the ID, KW, OC, and SQ fields once I parse the record. I have a code that would parse a GenBank record and would like to follow the same route if possible.

#!/usr/bin/perl
# Extract the annotation and sequence sections from the first
#  record of a GenBank library

use strict;
use warnings;
use BeginPerlBioinfo; 

# Declare and initialize variables
my $annotation = '';
my $dna = '';
my $record = '';
my $filename = 'sequence.gb';
my $save_input_separator = $/;

# Open GenBank library file
unless (open(GBFILE, $filename)) {
    print "Cannot open GenBank file \"$filename\"\n\n";
    exit;
}

# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";

$record = <GBFILE>;

# reset input separator 
$/ = $save_input_separator;

# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

# Print the two pieces, which should give us the same as the
#  original GenBank file, minus the // at the end
print $annotation, $dna;

exit;

Recommended Answers

All 5 Replies

I really don't know the bioinformatics subject matter involved here. I tried changing the regex and adding a chomp statement because including the newline \n in my regex caused it to fail on my computer for some reason. Here is what I changed:

# Now separate the annotation from the sequence data
#($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);#GenBank layout
($annotation, $dna) = ($record =~ /^(.*SQ\s*)(.*)\/\//s);#Trying to matchEMBL layout
chomp($annotation, $dna);

I really don't know the bioinformatics subject matter involved here. I tried changing the regex and adding a chomp statement because including the newline \n in my regex caused it to fail on my computer for some reason. Here is what I changed:

# Now separate the annotation from the sequence data
#($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);#GenBank layout
($annotation, $dna) = ($record =~ /^(.*SQ\s*)(.*)\/\//s);#Trying to matchEMBL layout
chomp($annotation, $dna);

Okay, I manipulated the script with the addition and I got the entire record. In order for me to get the specific data I want should I just input them when I declare my variables? Thanks

I really don't know the bioinformatics subject matter involved here. I tried changing the regex and adding a chomp statement because including the newline \n in my regex caused it to fail on my computer for some reason. Here is what I changed:

# Now separate the annotation from the sequence data
#($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);#GenBank layout
($annotation, $dna) = ($record =~ /^(.*SQ\s*)(.*)\/\//s);#Trying to matchEMBL layout
chomp($annotation, $dna);

Hi, since I need to pring out the $ID, $SQ, $KW, AND $OC within the file should I declare them as variables and then print them out? Thanks

Hi, since I need to pring out the $ID, $SQ, $KW, AND $OC within the file should I declare them as variables and then print them out? Thanks

Why declare four scalar variables just to store the four literal values you want to look for at the beginning of the lines? Also, I don't know why you want to follow the same route as illustrated in the script you posted. That script reads two mult-line records into two variables: $annotation and $dna, which it then prints. Why do that if what you really want is to print the lines from the file that begin with ID, SQ, KW, or OC?

Why not take the following approach?

  1. Read the file one line at a time
  2. Test each line to see if it begins with ID, SQ, KW, or OC
  3. Decide whether or not to print the line based on the result of the test?

Maybe I don't understand what you mean by 'parsing' the file but it seems to me that a simple script like the following does what you say you want to do:

#!/usr/bin/perl
#embl01.pl
use strict;
use warnings;

my $filename = '/home/david/Programming/data/EMBL_records.txt';

open my $fh, $filename or die "Could not open $filename: $!";

while (<$fh>){
    chomp;
    if (m/^(ID|SQ|KW|OC)/){#Does line start with ID, SQ, KW, or OC?
        print $_, "\n";
    }
}

This gives the following output:

ID   M91373; SV 1; linear; mRNA; STD; PLN; 1131 BP.
KW   peroxidase.
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   fabids; Cucurbitales; Cucurbitaceae; Cucumis.
SQ   Sequence 1131 BP; 314 A; 276 C; 229 G; 312 T; 0 other;
ID   M57705; SV 1; linear; mRNA; STD; ROD; 237 BP.
KW   thyroid peroxidase.
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea;
OC   Muridae; Murinae; Rattus.
SQ   Sequence 237 BP; 56 A; 87 C; 45 G; 49 T; 0 other;

Why declare four scalar variables just to store the four literal values you want to look for at the beginning of the lines? Also, I don't know why you want to follow the same route as illustrated in the script you posted. That script reads two mult-line records into two variables: $annotation and $dna, which it then prints. Why do that if what you really want is to print the lines from the file that begin with ID, SQ, KW, or OC?

Why not take the following approach?

  1. Read the file one line at a time
  2. Test each line to see if it begins with ID, SQ, KW, or OC
  3. Decide whether or not to print the line based on the result of the test?

Maybe I don't understand what you mean by 'parsing' the file but it seems to me that a simple script like the following does what you say you want to do:

#!/usr/bin/perl
#embl01.pl
use strict;
use warnings;

my $filename = '/home/david/Programming/data/EMBL_records.txt';

open my $fh, $filename or die "Could not open $filename: $!";

while (<$fh>){
    chomp;
    if (m/^(ID|SQ|KW|OC)/){#Does line start with ID, SQ, KW, or OC?
        print $_, "\n";
    }
}

This gives the following output:

ID   M91373; SV 1; linear; mRNA; STD; PLN; 1131 BP.
KW   peroxidase.
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   fabids; Cucurbitales; Cucurbitaceae; Cucumis.
SQ   Sequence 1131 BP; 314 A; 276 C; 229 G; 312 T; 0 other;
ID   M57705; SV 1; linear; mRNA; STD; ROD; 237 BP.
KW   thyroid peroxidase.
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea;
OC   Muridae; Murinae; Rattus.
SQ   Sequence 237 BP; 56 A; 87 C; 45 G; 49 T; 0 other;

Thank you for your response and I guess I should not have tried to follow the same procedure as the one posted. I simply needed to retrieve the four pieces of information and use a subroutine. I think that I was confusing the procedures because I wanted to use a subroutine from BeginPerlBioinfo.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.