954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

have some problems with pattern match hope you can help!!

Hi there...

I am working one a perl script program and hope someone can help me , I'll give you a quick description of my project:

My program is given a fasta file, a signal description and a deviation (a number) as input om the command line.
A fasta file look like this :
>U00659.CDS.1 product:"insulin GGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
AAAGACCAGACGGAGATGATGGTAAAGAGAGGTATTGTAGA
>X13559.CDS.1 product:"preproinsulin " DNA org:"Oncorhynchus keta" (CDS extraction)
ATGGCCTTCTGGCTCCAAGCTGCATCTCTGCTGGTGTTGCTGGCGCTCTCCCCCGGGGTA
GATGCTGCAGCTGCCCAGCACCTGTGTGGCTCTCACCTGGTGGACGCCCTCTATCTGGTG
TGTGGAGAGAAAGGATT
>J02989.CDS.1 note:"preproinsulin " DNA org:"Aotus trivirgatus" (CDS extraction)
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTC
The first line of a FASTA file is a header and begins with '>', thise line should be ignored
the main thing is the sequence (ATCGCGCTATA)hoe i want to match..

A Signal description file is a text file that look like this:

# Shine-Delgarno
T 7
T 8
G 6
A 5
C 5
A 5
# intervening unimportant bases
* 15-21
# Pribnow box
T 8
A 8
T 6
A 6
AT 5
T 8

1) one or more allowed letters at this position and a penalty
for having a mismatch at that position.

2) the star character * denoting unimportant characters in the sequence and an interval where these
unimportant characters are allowed.

3) the hash character # meaning this line is a comment, and should be ignored by the program.


Okay now to the main thing, the output should list all matches in each fasta entry, clearly stating the location of the match.

The deviation is an important factor. If the deviation is set to 0, then it should search for the signal
is reduced to a regular expression. If the deviation is set to 16 in the above example,
then mismatches with the combined penalty of 16 or less are allowed.

I have try this so far but i cant figure out how to used tha deviation number and set the patternmatch I am pretty lost:


#!/usr/bin/perl -w

use strict;
#############
# Step 1 #
#############
#The program is given fasta file, a signal description file and a deviaton number as input on the command line comments if there are erros:
#Erros: be sured that deviation is a number


sub usage {
my ($msg) = @_;
print "$msg\n\n" if defined $msg;
print "Usage: project.pl \n";
exit;
}
if (scalar @ARGV !=3){
&usage("Wrong number of arguments");
}

my ($fastafile, $signaldescription, $deviation) = @ARGV ;

if ($deviation =~ m/^\d+$/){ #correct input
print "Thanks!\n";
}else{
&usage ("I want a number please!");
}

################
# Step 2 #
################
# working with signal description:
#read the file and insure to put penalty and character in two seperate arrays,
#the # should be ignored
#the * unimportant sequence and should be ignored at position 15 -21 (have figure that yet):

open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = )) {
chomp ($line);
if ($line =~ m/^#/) {
if ($comment ne ''){
my ($character, $penalty) = split (' ',$line);
push @character, $character;
push @penalty, $penalty;
}
}
}

close IN;


############
# Step 3 #
############
#work with fasta file:
# Use regular Expresions to look at the fasta file and ignore the first line:


# $fragment: the pattern to search for
# $fraglen: the length of $fragment
# $buffer: a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA

my($fragment, $fraglen, $buffer, $position) = (@karaktere, '', 0);

my ($headline, $line, $dna) = ('', '', '');

open(IN, '>', $fastafilename) or die "Could not read file ($fastafile)\n";

# The first line of a FASTA file is a header and begins with '>'

while (defined ($line = )) {
if ($line =~ m/^>/) {
if ($headline ne '') { #after the sequence is readed i wanna look for the match

#write data to file (the matches):
chomp $headline;
print OUT "$headline\n";
for (my $i = 0; $i < length($reversecomplementdna); $i += 60) {
print OUT substr($reversecomplementdna, $i, 60), "\n";
}
# Get ready for next turn in the loop
$dna = '';+
}
$headline = $line;
}
else {
# Read the DNA
chomp $line;
$dna .= $line;
}
}
#########################

Thanks alot for your time, i really apriciet your time and if you can help me....

thanxxxx
MojoS

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

Is this school work?

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

Yes its one of my project that I am working on at my uni, and i will appreciate if someone can help me out by giving me some comments and ideas on how to work it out, because Iam pretty lost...

Thanxx

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

First thing you have to do is fix the errors:

Global symbol "@karaktere" requires explicit package name at script line 68.
Global symbol "$fastafilename" requires explicit package name at script line 72.
Global symbol "$reversecomplementdna" requires explicit package name at script line 83.
Global symbol "$reversecomplementdna" requires explicit package name at script line 84.
syntax error at script line 88, near "}"

How come you are not getting help from a teacher or fellow student?

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

Thanks for looking throught it; unfortunately my professor is not always available and it would take some time before I could go on with this without any advice(because I'm stuck). As I'm doing this project on my own I don't have any fellowstudents to aks or exchange views with.
But thanks anyway

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

OK, well, fix those errors otherwise your code will not even compile.

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

okay I have fix it thanks!

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

Okay but can you advice me how to work with pattern match, if i want to ignore some character at a specific position.....

signaldescription:
# Shine-Delgarno
T 7
T 8
G 6
A 5
C 5
A 5
* 15-21 # intervening unimportant bases
# Pribnow box
T 8
A 8
T 6
A 6
AT 5
T 8


open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = )) {
chomp ($line);
if ($line =~ m/^#/) {
if ($comment ne ''){
my ($character, $penalty) = split (' ',$line);
push @character, $character;
push @penalty, $penalty;
}
}
}

close IN;


Here I have tried to save the penalty that contain the characters in an array and the penalties that contain number in another array but I cant figure out how to perlscript my program to skip a certain position when it meet the character *...

thanks..

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

If have the time I might try and help. Your question, or questions, are really more than asking for general help and will take some time to try and assist you with.

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

what is $comment being used for?

my $comment ='';


further on you have:

if ($comment ne ''){


but $comment is blank ('') so that condition is always false so the expressions that follow are never evaluated.

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

Okay I see, but thanks alot for you help I appriciet your time really....
The hash character # is indicating that this line is a comment, and should be ignored by the program. thats why I have my $comment ='';....
about if ($comment ne ''){ isn't that the why to ignore someting after you have readed it !!!!

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

I am confused by what you said.

($comment ne '') is not a way to ignore something after you read it, but I am not sure what you mean by "after you read[ed] it".

The past tense of "read" is spelled "read" in english, not readed, the pronunciation is different though.

If you wanted to ignore lines that start with "#" in a loop:

next if (/^#/);

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

hmm okay lat me explained it in another way...I have this data file containing some characters, what I want is to use my txt file (so called signaldescription (what I wanna match with)) to finde where the matches is (position) and by that I have to read both files... isnt that right
you're probably more confused now :(

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

You will have to read both files. But lets look at his part of the code you posted:

open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = <IN>)) {
   chomp ($line);
   if ($line =~ m/^#/) {
      if ($comment ne ''){
         my ($character, $penalty) = split (' ',$line);
         push @character, $character;
         push @penalty, $penalty;
      }
   }
}


$comment is defined as '' (an empty string). Then in the code you check to see if $comment isnot equal to an empty string, but of course it is, so the next three lines (the expressions) never are evaluated. If the intention is to skip comments

open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
while (defined (my $line = <IN>)) {
   next if $line =~ m/^#/;
   chomp ($line);
   my ($character, $penalty) = split (' ',$line);
   push @character, $character;
   push @penalty, $penalty;
}
KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

HMM for example the signaldescription file is given, the data file is given and a deviation nr...

signaldescription file:
#ONKLE PHIL
A 3
T 4
G 3
* 4-6 #charcters at pos 4 to 6 should be skipt
#ONKLE TOM
T 5
G 7
C 9

and the data file look like that:
ATGTGTAGTTGCATG

the output should then be:
the program found ONKLE PHIL (ATG) at position 1 and 13 and ONKLE TOM (TGC) at position 9... hmm now the numbers beside the letters are penalty whenever its not founding an A at position one its get a penalty of 3 and the secound position if its not a T it gets a 4 if my diviation nr is put on 5 it will ignore all the characters we have look for untill now because 7 is bigger than 5, after that we get to the third letter etc...!!!!!!
hmm hope its not a headache for u to read...
thanks

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

yeah okay I see, a stupid question : is my comment then saved if I wanna used it later!!!

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 
yeah okay I see, a stupid question : is my comment then saved if I wanna used it later!!!

No. You would have to save lines that are comments to a variable or an array/list.

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 
the output should then be: the program found ONKLE PHIL (ATG) at position 1 and 13 and ONKLE TOM (TGC) at position 9... hmm now the numbers beside the letters are penalty whenever its not founding an A at position one its get a penalty of 3 and the secound position if its not a T it gets a 4 if my diviation nr is put on 5 it will ignore all the characters we have look for untill now because 7 is bigger than 5, after that we get to the third letter etc...!!!!!!

This is all quite confusing to me. Are you aware there are many modules already written to work on this kind of data, and there is also bioperl. I think it's www.bioperl.org

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

thanx for the web ill chek it out..

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

But thanks anyways for your help and your time... ill see how far ill get with it...

MojoS
Light Poster
31 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You