have some problems with pattern match hope you can help!!

Question

MojoS 0 Light Poster

18 Years Ago

Hi there...

I am working one a perl script program and hope someone can help me , I'll give you a quick description of my project:

My program is given a fasta file, a signal description and a deviation (a number) as input om the command line.
A fasta file look like this :

>U00659.CDS.1  product:"insulin GGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
AAAGACCAGACGGAGATGATGGTAAAGAGAGGTATTGTAGA
>X13559.CDS.1  product:"preproinsulin " DNA org:"Oncorhynchus keta" (CDS extraction)
ATGGCCTTCTGGCTCCAAGCTGCATCTCTGCTGGTGTTGCTGGCGCTCTCCCCCGGGGTA
GATGCTGCAGCTGCCCAGCACCTGTGTGGCTCTCACCTGGTGGACGCCCTCTATCTGGTG
TGTGGAGAGAAAGGATT
>J02989.CDS.1  note:"preproinsulin " DNA org:"Aotus trivirgatus" (CDS extraction)
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTC

The first line of a FASTA file is a header and begins with >, thise line should be ignored
the main thing is the sequence (ATCGCGCTATA)hoe i want to match..

A Signal description file is a text file that look like this:

# Shine-Delgarno
T    7
T    8
G    6
A    5
C    5
A    5
# intervening unimportant bases
*    15-21
# Pribnow box
T    8
A    8
T    6
A    6
AT    5
T    8

1) one or more allowed letters at this position and a penalty
for having a mismatch at that position.

2) the star character denoting unimportant characters in the sequence and an interval where these
unimportant characters are allowed.

3) the hash character meaning this line is a comment, and should be ignored by the program.

Okay now to the main thing, the output should list all matches in each fasta entry, clearly stating the location of the match.

The deviation is an important factor. If the deviation is set to 0, then it should search for the signal
is reduced to a regular expression. If the deviation is set to 16 in the above example,
then mismatches with the combined penalty of 16 or less are allowed.

I have try this so far but i cant figure out how to used tha deviation number and set the patternmatch I am pretty lost:

#!/usr/bin/perl -w


use strict;
#############
#  Step 1   #
#############
#The program is given fasta file, a signal description file and a deviaton number as input on the command line comments if there are erros:
#Erros: be sured that deviation is a number



sub usage {
my ($msg) = @_;
print "$msg\n\n" if defined $msg;
print "Usage: project.pl <fastafile.fsa> <signaldescriptionfile.txt> <deviation>\n";
exit;
}
if (scalar @ARGV !=3){
&usage("Wrong number of arguments");
}


my ($fastafile, $signaldescription, $deviation) = @ARGV ;


if ($deviation =~ m/^\d+$/){ #correct input
print "Thanks!\n";
}else{
&usage ("I want a number please!");
}


################
#    Step 2    #
################
# working with signal description:
#read the file and insure to put penalty and character in two seperate arrays,
#the # should be ignored
#the * unimportant sequence and should be ignored at position 15 -21 (have figure that yet):


open(IN,'<',$signaldescription ) or die "Could not find file\n";
my @character = ();
my @penalty = ();
my $comment ='';
while (defined (my $line = <IN>)) {
chomp ($line);
if ($line =~ m/^#/) {
if ($comment ne ''){
my ($character, $penalty) = split (' ',$line);
push  @character, $character;
push  @penalty, $penalty;
}
}
}



close IN;



############
#  Step 3  #
############
#work with fasta  file:
# Use regular Expresions to look at the fasta file and ignore the first line:



# $fragment: the pattern to search for
# $fraglen:  the length of $fragment
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA


my($fragment, $fraglen, $buffer, $position) = (@karaktere, '', 0);


my ($headline, $line, $dna) = ('', '', '');


open(IN, '>', $fastafilename) or die "Could not read file ($fastafile)\n";


# The first line of a FASTA file is a header and begins with '>'


while (defined ($line = <IN>)) {
if ($line =~ m/^>/) {
if ($headline ne '') {   #after the sequence is readed  i wanna look for the match


#write data to file (the matches):
chomp $headline;
print OUT "$headline\n";
for (my $i = 0; $i < length($reversecomplementdna); $i += 60) {
print OUT substr($reversecomplementdna, $i, 60), "\n";
}
# Get ready for next turn in the loop
$dna = '';+
}
$headline = $line;
}
else {
# Read the DNA
chomp $line;
$dna .= $line;
}
}
#########################

Thanks alot for your time, i really apriciet your time and if you can help me....

thanxxxx
MojoS

perl

Edited 12 Years Ago by happygeek because: fixed formatting

2 Contributors
27 Replies
210 Views
6 Days Discussion Span
Latest Post 18 Years Ago Latest Post by KevinADC