Hi there...

I am working one a perl script program and hope someone can help me , I'll give you a quick description of my project:

My program is given a fasta file, a signal description and a deviation (a number) as input om the command line.
A fasta file look like this :

>U00659.CDS.1  product:"insulin GGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
AAAGACCAGACGGAGATGATGGTAAAGAGAGGTATTGTAGA
>X13559.CDS.1  product:"preproinsulin " DNA org:"Oncorhynchus keta" (CDS extraction)
ATGGCCTTCTGGCTCCAAGCTGCATCTCTGCTGGTGTTGCTGGCGCTCTCCCCCGGGGTA
GATGCTGCAGCTGCCCAGCACCTGTGTGGCTCTCACCTGGTGGACGCCCTCTATCTGGTG
TGTGGAGAGAAAGGATT
>J02989.CDS.1  note:"preproinsulin " DNA org:"Aotus trivirgatus" (CDS extraction)
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTC

The first line of a FASTA file is a header and begins with >, thise line should be ignored
the main thing is the sequence (ATCGCGCTATA)hoe i want to match..

A Signal description file is a text file that look like this:

# Shine-Delgarno
T    7
T    8
G    6
A    5
C    5
A    5
# intervening unimportant bases
*    15-21
# Pribnow box
T    8
A    8
T    6
A    6
AT    5
T    8

1) one or more allowed letters at this position and a penalty
for having a mismatch at that position.

2) the star character denoting unimportant characters in the sequence and an interval where these
unimportant characters are allowed.

3) the hash character meaning this line is a comment, and should be ignored by the program.

Okay now to the main thing, the output should list all matches in each fasta entry, clearly stating the location of the match.

The deviation is an important factor. If the deviation is set to 0, then it should search for the signal
is reduced to a regular expression. If the deviation is set to 16 in the above example,
then mismatches with the combined penalty of 16 or less are allowed.

I have try this so far but i cant figure out how to used tha deviation number and set the patternmatch I am pretty lost:

#!/usr/bin/perl -w


use strict;
#############
#  Step 1   #
#############
#The program is given fasta file, a signal description file and a deviaton number as input on the command line comments if there are erros:
#Erros: be sured that deviation is a number



sub usage {
my ($msg) = @_;
print "$msg\n\n" if defined $msg;
print "Usage: project.pl <fastafile.fsa> <signaldescriptionfile.txt> <deviation>\n";
exit;
}
if (scalar @ARGV !=3){
&usage("Wrong number of arguments");
}


my ($fastafile, $signaldescription, $deviation) = @ARGV ;


if ($deviation =~ m/^\d+$/){ #correct input
print "Thanks!\n";
}else{
&usage ("I want a number please!");
}


################
#    Step 2    #
################
# working with signal description:
#read the file and insure to put penalty and character in two seperate arrays,
#the # should be ignored
#the * unimportant sequence and should be ignored at position 15 -21 (have figure that yet):


open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = <IN>)) {
chomp ($line);
if ($line =~ m/^#/) {
if ($comment ne ''){
my ($character, $penalty) = split (' ',$line);
push  @character, $character;
push  @penalty, $penalty;
}
}
}



close IN;



############
#  Step 3  #
############
#work with fasta  file:
# Use regular Expresions to look at the fasta file and ignore the first line:



# $fragment: the pattern to search for
# $fraglen:  the length of $fragment
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA


my($fragment, $fraglen, $buffer, $position) = (@karaktere, '', 0);


my ($headline, $line, $dna) = ('', '', '');


open(IN, '>', $fastafilename) or die "Could not read file ($fastafile)\n";


# The first line of a FASTA file is a header and begins with '>'


while (defined ($line = <IN>)) {
if ($line =~ m/^>/) {
if ($headline ne '') {   #after the sequence is readed  i wanna look for the match


#write data to file (the matches):
chomp $headline;
print OUT "$headline\n";
for (my $i = 0; $i < length($reversecomplementdna); $i += 60) {
print OUT substr($reversecomplementdna, $i, 60), "\n";
}
# Get ready for next turn in the loop
$dna = '';+
}
$headline = $line;
}
else {
# Read the DNA
chomp $line;
$dna .= $line;
}
}
#########################

Thanks alot for your time, i really apriciet your time and if you can help me....

thanxxxx
MojoS

Recommended Answers

All 27 Replies

Is this school work?

Yes its one of my project that I am working on at my uni, and i will appreciate if someone can help me out by giving me some comments and ideas on how to work it out, because Iam pretty lost...

Thanxx

First thing you have to do is fix the errors:

Global symbol "@karaktere" requires explicit package name at script line 68.
Global symbol "$fastafilename" requires explicit package name at script line 72.
Global symbol "$reversecomplementdna" requires explicit package name at script line 83.
Global symbol "$reversecomplementdna" requires explicit package name at script line 84.
syntax error at script line 88, near "}"

How come you are not getting help from a teacher or fellow student?

Thanks for looking throught it; unfortunately my professor is not always available and it would take some time before I could go on with this without any advice(because I'm stuck). As I'm doing this project on my own I don't have any fellowstudents to aks or exchange views with.
But thanks anyway

OK, well, fix those errors otherwise your code will not even compile.

okay I have fix it thanks!

Okay but can you advice me how to work with pattern match, if i want to ignore some character at a specific position.....

signaldescription:
# Shine-Delgarno
T 7
T 8
G 6
A 5
C 5
A 5
* 15-21 # intervening unimportant bases
# Pribnow box
T 8
A 8
T 6
A 6
AT 5
T 8


open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = <IN>)) {
chomp ($line);
if ($line =~ m/^#/) {
if ($comment ne ''){
my ($character, $penalty) = split (' ',$line);
push @character, $character;
push @penalty, $penalty;
}
}
}


close IN;


Here I have tried to save the penalty that contain the characters in an array and the penalties that contain number in another array but I cant figure out how to perlscript my program to skip a certain position when it meet the character *...

thanks..

If have the time I might try and help. Your question, or questions, are really more than asking for general help and will take some time to try and assist you with.

what is $comment being used for?

my $comment ='';

further on you have:

if ($comment ne ''){

but $comment is blank ('') so that condition is always false so the expressions that follow are never evaluated.

Okay I see, but thanks alot for you help I appriciet your time really....
The hash character # is indicating that this line is a comment, and should be ignored by the program. thats why I have my $comment ='';....
about if ($comment ne ''){ isn't that the why to ignore someting after you have readed it !!!!

I am confused by what you said.

($comment ne '') is not a way to ignore something after you read it, but I am not sure what you mean by "after you read[ed] it".

The past tense of "read" is spelled "read" in english, not readed, the pronunciation is different though.

If you wanted to ignore lines that start with "#" in a loop:

next if (/^#/);

hmm okay lat me explained it in another way...I have this data file containing some characters, what I want is to use my txt file (so called signaldescription (what I wanna match with)) to finde where the matches is (position) and by that I have to read both files... isnt that right
you're probably more confused now :(

You will have to read both files. But lets look at his part of the code you posted:

open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = <IN>)) {
   chomp ($line);
   if ($line =~ m/^#/) {
      if ($comment ne ''){
         my ($character, $penalty) = split (' ',$line);
         push @character, $character;
         push @penalty, $penalty;
      }
   }
}

$comment is defined as '' (an empty string). Then in the code you check to see if $comment is not equal to an empty string, but of course it is, so the next three lines (the expressions) never are evaluated. If the intention is to skip comments

open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
while (defined (my $line = <IN>)) {
   next if $line =~ m/^#/;
   chomp ($line);
   my ($character, $penalty) = split (' ',$line);
   push @character, $character;
   push @penalty, $penalty;
}

HMM for example the signaldescription file is given, the data file is given and a deviation nr...

signaldescription file:
#ONKLE PHIL
A 3
T 4
G 3
* 4-6 #charcters at pos 4 to 6 should be skipt
#ONKLE TOM
T 5
G 7
C 9

and the data file look like that:
ATGTGTAGTTGCATG

the output should then be:
the program found ONKLE PHIL (ATG) at position 1 and 13 and ONKLE TOM (TGC) at position 9... hmm now the numbers beside the letters are penalty whenever its not founding an A at position one its get a penalty of 3 and the secound position if its not a T it gets a 4 if my diviation nr is put on 5 it will ignore all the characters we have look for untill now because 7 is bigger than 5, after that we get to the third letter etc...!!!!!!
hmm hope its not a headache for u to read...
thanks

yeah okay I see, a stupid question : is my comment then saved if I wanna used it later!!!

yeah okay I see, a stupid question : is my comment then saved if I wanna used it later!!!

No. You would have to save lines that are comments to a variable or an array/list.

the output should then be:
the program found ONKLE PHIL (ATG) at position 1 and 13 and ONKLE TOM (TGC) at position 9... hmm now the numbers beside the letters are penalty whenever its not founding an A at position one its get a penalty of 3 and the secound position if its not a T it gets a 4 if my diviation nr is put on 5 it will ignore all the characters we have look for untill now because 7 is bigger than 5, after that we get to the third letter etc...!!!!!!

This is all quite confusing to me. Are you aware there are many modules already written to work on this kind of data, and there is also bioperl. I think it's www.bioperl.org

thanx for the web ill chek it out..

But thanks anyways for your help and your time... ill see how far ill get with it...

finding the position of the match should be easy enough, but I don't unerstand the penalty or what it does.

The penalty number beside the letter in my text file is the number of "punish" for a mismatch...
now the deviation is a number you choose to set on the command line, the deviation is depended on the penalty...
if for exsample the diviation is set to 5 (penalty sum of mismatch) the mismatches with the combined penalty of 5 or less are allowed...

hmm maybe you will understand it better with an example:

my data where I wanna find matches: TTATGTAGA

my deviation is set to 4

my text file who tell us what we wanna look for and what the penalty of each character are set to:

#onkle phil
A 2
T 1
C 3
* 5-7 (ignore the letters at position 5-7)
#onkle tom
T 3
T 4
A 1

hmm okay so first we are looking for my dear onkle phil:
we look at the data that is given TTATGTAGA, onkel phil (ATC) start with an A, the data we are given start with a T so when there is a mismatch we punish it with 2, then onkle phil have a T, that match because at position 2 (in pelr position 1) and at position 2 at the given data there is also a T, so we dont have a penalty (0) then to the third position we have a C comparing with an A , mismatch here with penalty of 3.. now we have exceed the deviation (3+2 (sum of penalties) is greater than 4 (the diviation nr)) so in this case we have to ignore those characters and start over from position 2 (in perl position 1) and the program have to do this untill it found a macth with penalty that is less or equal to deviation............

hmm is that clear or am I too bad to explaine :$

I think I understand your explanation. Have you made any progress?

I think I understand your explanation. Have you made any progress?

No I have made much progress, but I am working on figuring something out!!!
The deadline is tomorrow so I am pretty frustrated how much I am able to hand in!

What I have done so far is :

#!/usr/bin/perl -w

#use warnings;
use strict;
#use carp;

#############
#  Step 1   #
#############
#The program is given fasta file, a signal description file and a deviaton number as input on the command line comments if there are erros:
#Be sured that deviation is a number, and the number of arguments is 3.


sub usage {
   my ($msg) = @_;
   print "$msg\n\n" if defined $msg;
   print "Usage: project.pl <fastafile.fsa> <signaldescriptionfile.txt> <deviation>\n";
   exit;
}
if (scalar @ARGV !=3){
      &usage("Wrong number of arguments");
}    

my ($fastafile, $signaldescription, $deviation) = @ARGV ;

if ($deviation =~ m/^\d+$/){ #correct input
   print "Thanks!\n";
   }else{
   &usage ("I want a number please!");
}

################
#    Step 2    #
################

# working with signal description:
#read the file and insure to put penalty and character into two seperate arrays,
#the # should be ignored
#the * unimportant sequence and should be ignored at the given  position 

open(IN,'<',$signaldescription ) or die "Could not find file ($signaldescription) \n";
my @character = ();
my @penalty = ();
while (defined (my $line = <IN>)) { 
   chomp ($line);
   if ($line =~ m/^#/) {
      chomp ($line);
   #save the unimportant interval for later on
   my $unimportant_characters = $1 if $line = ~ m/^\*\s+(.+)/;
      my ($character, $penalty) = split (' ',$line); 
      push  @character, $character;
      push  @penalty, $penalty;
      }
   }  

   

close IN;


############
#  Step 3  #
############
#work with fasta  file:
# Use regular Expresions to look at the fasta file and ignore the first line:


# $fragment: the pattern to search for
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA
# $penalty : the penalty for every mismatch character
# $fraglen : the lengt of $fragment
my($fragment, $buffer, $penalty, $position, $fraglen) = (@character,'',@penalty,0,scalar(@character));

my ($headline, $line, $dna) = ('', '', '');

open(IN, '<', $fastafile) or die "Could not read file ($fastafile)\n";

# The first line of a FASTA file is a header and begins with '>' that have to be ignored.

while (defined ($line = <IN>)) {
   if ($line =~ m/^>/) {
      if ($headline ne '') {   #after the sequence is read  i wanna look for the match
      
      #write data to file (the matches):
      chomp $headline;
      $buffer .=$headline;
      
         # Get ready for next turn in the loop
         $dna = '';
      }
      $headline = $line;
   }else {
      # Read the DNA
      chomp $line;
      $dna .= $line;
    
    }while ($buffer =~ /$fragment/gi){
         print $headline,"\n";
     print "Found $fragment at position", $position + $-[0] + 1,"\n";
         }
   
   $position = $position + length ($buffer) - $fraglen + 1;
   $buffer = substr ($buffer, length ($buffer) - $fraglen + 1, $fraglen - 1);
}





#just to check what there are in those:
print $headline, "\n";
print my $unimportant_characters, "\n";
print @character, "\n";
print my @panalty, "\n";

but to be honest with you I am so confused about my own script :(

this line is messing up your entire script (as it is anyway) :

my $unimportant_characters = $1 if $line = ~ m/^\*\s+(.+)/;

if you did some basic debugging by printing variables to the screen you would see the above line is returning a bizarre value to $line. The culprit is the gap between "= ~". There should be no gap: "=~". What you did was create an assignment to $line "=" which changed the value of $line to 4294967295. So the next line in your script is being evaluated like this for every line of the file:

my (4294967295, ) = split (' ',4294967295)

so @characters is populated with the value 4294967295 for each line of the file. What you end up with at the end of that block is:

@character = (4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295 4294967295)
@penalty = ()

instead of:

@character = (T T G A C A * T A T A AT T)
@penalty = (7 8 6 5 5 5 15-21 8 8 6 6 5 8)

I did not check the rest of your script. Correct the above syntax error "= ~" and retry your script.

Hey Kevin,

I cant thank you enough for your time and help...
I had correct the syntax above but it still not given me
@character = (T T G A C A * T A T A AT T)
@penalty = (7 8 6 5 5 5 15-21 8 8 6 6 5 8)

open(IN,'<',$signaldescription ) or die "Could not find file ($signaldescription) \n";
my @character = ();
my @penalty = ();
while (defined (my $line = <IN>)) { 
   chomp ($line);
   if ($line =~ m/^#/) {
      chomp ($line);
   #save the unimportant interval for later on
   my $unimportant_characters = $1 if $line =~ m/^\*\s+(.+)/;
      my ($character, $penalty) = split (' ',$line); 
      push  @character, $character;
      push  @penalty, $penalty;
      }
   }  
print @character;
print @penalty;


close IN;

I got instead
##shinePribnow

hmmm those character I wanna get rid of

while (defined (my $line = <IN>)) { 
   chomp ($line);
  [B] next if ($line =~ m/^#/);[/B]
   #save the unimportant interval for later on
   my $unimportant_characters = $1 if $line =~ m/^\*\s+(.+)/;
   my ($character, $penalty) = split (/\s+/,$line); 
   push  @character, $character;
   push  @penalty, $penalty;
}
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.