Hello, I need help to fix a perl script that obtains a DNA file with a FASTA sequence and randomly shuffle the sequence while maintaining the overall distribution of bases. I must also randomly shuffle the sequence 10-20 times.
Then measure the similarity between the shuffled sequence versus the original sequence by scoring. (z-scores).

Scoring criteria:
If a purine was mutated to a purine, or a pyrimidine to a pyrimidine assign a value of +1 to that base pair. If a purine was mutated to a pyrimidine or vice versa assign a value of -1 to that base pair. If the base pair did not change assign 0 .e.g.

Here is what is giving me the issue in the script ( I think):
(@sequence contains the FASTA sequence by the way)

random_int(10, 20);
print "$x\n";


@sequence_shuffle = shufflearray(@sequence);

#subroutine that randomly shuffles the sequence in the file as many times as $x and takes the sequence as an argument.
sub shufflearray {
my (@sequence) = @_;

for (my $i = 0; $i < $length; $i++) {
my $j = $x;
my $tmp = $sequence[$i];
$sequence[$i] = $sequence[$j];
$sequence[$j] = $sequence[$i];
}
return "@sequence\n";

Recommended Answers

All 24 Replies

Please wrap your script (or the perl code you wish to show us) in [CODE]Your program goes here[/CODE] tags.

You say part of the script is giving you "the issue". Can you show us the issue, preferably by giving an example of an input sequence, an example of the output you expect and an example of the unsatisfactory output that you are getting.

What I presented earlier is what I have so far and the sequence is not shuffling. I think I have to use "rand" but am unsure on how to incorporate everything into my script.

All your scripts should include the following:

use strict;
use warnings;

These will give error messages or warnings if your script does things that could potentially result in unexpected results. Fix your script according to the error messages or warnings that the strict and warnings modules give.

For example, let's run the first two statements from what you posted:

#!/usr/bin/perl
use strict;
use warnings;

random_int(10, 20);
print "$x\n";

This gives the following output:

Global symbol "$x" requires explicit package name at /home/david/Programming/Perl/temp.pl line 6.
Execution of /home/david/Programming/Perl/temp.pl aborted due to compilation errors.

Also, you print $x but you don't show us any statement assigning a value to $x... so what is $x supposed to contain?

Thanks for the tip. Do I have to first create a subroutine to extract the FASTA sequence?

I would create the main outline of the logic first. If it needs to call a subroutine, you can create a dummy subroutine (sometimes called a 'stub') and modify it later after figuring out the details of how it will accomplish its task. A subroutine stub consists of the subroutine name plus logic to assign the arguments to its own variables, plus a comment to indicate it is just a stub and the real logic needs to be filled in later.

Your first task, I think, is to calculate the number of times you need to call the shufflearray subroutine and assign this number to a variable. If you haven't already figured out how to extract the FASTA sequence (whatever that is), start with a hardcoded sequence to test your subroutine for the first time. An example of a draft of your main logic might be something like this:

#!/usr/bin/perl
use strict;
use warnings;

#Create integer between 10 and 20
my $times_to_shuffle = 10 + int(rand(21));

#Dummy sequence for testing
my @sequence = qw(C G T T T G T A A A T T G C A T C A A G);
my @shuffled_sequence;

print "Shuffle sequence $times_to_shuffle times.\n";
foreach (1..$times_to_shuffle){
    @shuffled_sequence = shufflearray(\@sequence); #Pass reference to array as argument
    print "@shuffled_sequence\n";
}

sub shufflearray{
    #This sub is just a stub
    #Logic needed to randomly resequence @in an return as @out
    my @in = @{$_[0]}; #Dereference passed array ref
    
    my @out = @in;
    return @out;
}

What I posted above is not quite right. You don't want to start with the original sequence each time you shuffle. You want to replace the original sequence with the result of shuffling and use then shuffle the new sequence, and so on for a random number of repetitions.

Plus, you don't have to write your own shufflearray subroutine. Perl 5.7 and later comes with the List::Util module that you can import and use for its shuffle() function. The revised outline script would look like this.

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle); #This module includes a method to shuffle arrays.

#Create integer between 10 and 20
my $times_to_shuffle = 10 + int(rand(21));

#Dummy sequence for testing
my @sequence = qw(C G T T T G T A A A T T G C A T C A A G);

print "Shuffle sequence $times_to_shuffle times.\n";
foreach (1..$times_to_shuffle){
    @sequence = shuffle(@sequence);
    print "@sequence\n";
}

This gives the following output:

Shuffle sequence 17 times.
T C A G A T A A T G A G T T T G C C T A
C T G A A G C T T A A T T G T C T A G A
G G G T C A C T T G A A A A T T C T T A
G A A G A C T G T C T C G T A T T T A A
G T G A A G A T T T T C G T A C T C A A
A G T C T A A G G T C A T T G T C A A T
T C T T G C C A A T A A T G T G G A A T
A T A C T G T A A G T G T A T C T A G C
A C T A G T T A T C A T T C G T G A G A
T T A C T C A G A T T A T G T C A G G A
G A G G C T A C A T T C A T A T T T G A
G A C C T T T G T G G A T A A A C T A T
G T T C A A G C A G T A T T A A T G C T
A C A A G T T A C T C G T T G T A A G T
A A T T T C T G C T A G A G A T C T G A
T C A A A G T T T G C T G A A C A T G T
G T T C A A A G T A G C T T A G A C T T

Thank you very much for the help. I am going over all of the elements and will have to add how to maintain the same amount of base distribution. Also, do you have any links on z-scores with Perl. I have been looking for some useful information to help me with my script but can not find it.

You're welcome. I really don't know anything about z-scores as I haven't studied biology for about 40 years.:) You could look for a list of modules on CPAN but I don't know which module would best suit your purpose. Maybe this one?

Thanks for the suggestion. I looked through it but I did not see anything. In order to get the same base distribution I was thinking that I had to use the srand expression. I would like to know if I am going in the right direction. Thank you

Thanks for the suggestion. I looked through it but I did not see anything. In order to get the same base distribution I was thinking that I had to use the srand expression. I would like to know if I am going in the right direction. Thank you

If you want rand to return the same sequence each time you run your program then using srand with a constant makes sense to me. If you don't explicitly call srand, it is called implicitly each time you run your program using time etc. as arguments so that shuffling can give you different results. (For most purposes, you do want shuffling to give unpredictable results.) I haven't tried it but that's what I read in http://perldoc.perl.org/functions/srand.html.

Okay I read through the document and I am starting to think that within my sub I need to incorporate the regex s//g. I think I should put in the purines and pyrimidines but after that I am unsure.

How about something like this. Make an array of strings representing the base pair combinations of original and shuffled sequences. Then assign a score to each base pair according to the rules which can be represented by a hash.

#!/usr/bin/perl
use strict;
use warnings;

#If a purine was mutated to a purine,
#or a pyrimidine to a pyrimidine assign a value of +1 to that base pair.
#If a purine was mutated to a pyrimidine or vice versa
#assign a value of -1 to that base pair.
#If the base pair did not change assign 0.

my %rules = (
            AA => 0,
            AG => +1,
            AT => -1,
            AC => -1,
            GA => +1,
            GG => 0,
            GT => -1,
            GC => -1,
            TA => -1,
            TG => -1,
            TT => 0,
            TC => +1,
            CA => -1,
            CG => -1,
            CT => +1,
            CC => 0
            );

#Dummy sequence for testing
my @original = qw(C G T T T G T A A A T T G C A T C A A G);
my @shuffled = qw(G T T C A A A G T A G C T T A G A C T T);

my @scores;
my @base_pairs = make_base_pairs(\@original, \@shuffled);
foreach my $bp (@base_pairs){
    push @scores, $rules{$bp}
}
print join("\t", @base_pairs), "\n";
print join("\t", @scores);
sub make_base_pairs{
    my @orig = @{$_[0]}; #de-reference input array
    my @shuf = @{$_[1]}; #de-reference input array
    my $idx = 0;
    my @bps;
    foreach my $base (@orig){
        push @bps, $base . $shuf[$idx];
        $idx++;
    }
    return @bps;
}

Gives the following output:

CG	GT	TT	TC	TA	GA	TA	AG	AT	AA	TG	TC	GT	CT	AA	TG	CA	AC	AT	GT
-1	-1	0	1	-1	1	-1	1	-1	0	-1	1	-1	1	0	-1	-1	-1	-1	-1

That is a great idea and thank you for all of your help.

My sequence is quite large so I was creating a shuffled sequence to the original sequence but I keep on getting a message that there is an "uninitialized value." My sequence has over 1,000 bases.

My sequence is quite large so I was creating a shuffled sequence to the original sequence but I keep on getting a message that there is an "uninitialized value." My sequence has over 1,000 bases.

Can you attach your sequence as a text file to your post? (See the "Manage Attachments" button.)

Can you attach your sequence as a text file to your post? (See the "Manage Attachments" button.)

No problem here it is.

I don't know what caused the error message you were getting but if you read the entire file into a string variable, make sure all letters are upper-case, remove all non-letter characters (including spaces and carriage-returns), and split the string into an array it should work OK.

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle); #This module includes a method to shuffle arrays.

#Create integer between 10 and 20
my $times_to_shuffle = 10 + int(rand(11));
my $filename = 'Test_sequence.txt';
my $string = slurp_file($filename);
$string = uc($string); #Make sure all letters are upper case
$string =~ s/[^AGCT]//; #Remove all characters that are not A, G, C, or T

my @original = $string =~ m/[AGCT]/g; #Assign all bases to an array
my $base_count = @original; #Count elements in array
print "Number of bases in sequence is $base_count \n";

print "Shuffle sequence $times_to_shuffle times.\n";
my @shuffled = @original; #Copy original array to array to be shuffled.
foreach (1..$times_to_shuffle){
    @shuffled = shuffle(@shuffled);
}

print "@shuffled\n";

sub slurp_file {
    my $file = $_[0];
    local $/;
    open( FH, '<', $file ) or die "Could not open $file ... $!";
    my $text = <FH>;
    return $text;
}

Gives the following output:

Number of bases in sequence is 829 
Shuffle sequence 11 times.
G T A A G A T A A T A A A A G T G T T G T A G C A A G G A T T A G T A T A T A C G C C T C T C T G T T C A A C C C G C C C A C C A A A T C T G G A T T C G G A C T A T A T A T G G C T C A A G G G T T G G T C G A G A A T C G A C G T C A A G T T A G G C T A T A A A G C G G C G T G A G T C G G A G A T G T T C C G G C T T T C A G A G T C T A T C T A A A T T G T T G T T C G G A G G T G A C T A T A G G G G A C T C G G A A T C G T C T C C G G A A G A G A G T G T T C C T T C T A C T G C A G G C T A C T T A A A C T C T T A C T A T A A T G T A A C G C T A C C A T T C G C T G C A G T G T T G T C G C G C C A T T C C T T T G T G T A A G A T T C C T T A G G A C C A C C A C A C A A G G T G G T A G T A T A T A A A C A G G A G G A T A C A A A T T A T G C A A G G C G T T C A C A C C C G A A T T G G C T C C A A C A T G T G T A C T A C C A G A A G C T A T C C C G A C A T C T T T T C G A T A G A T T G G T T A C C C A A T A C A T T G C A G C C A T T T G C C A G A T G T T G G T T A C C A G A G A C T G G T T A A G G T G A G C C T A G G C A A A G A T C G A A T C A T G C C G G A A C C T A C T G A T C T A T G A T A G T T A G C A A A T A T G A T G G T T G A C G A G T A T G T A T A A C T T C G C G G A G T A T T A C A G G A A A A G C G A A A A G A C T G T G T T C C A T A A A T T A A T G A A C G T C T G A G G A A T T C C C T C G T G C C T T A G C C A T T C G T T G T G T T A T G G T A G T A T A A T C T C A T C G A A T G C A A G C G A T C G C T G A A C A C G G T G C A G T G T C T G C G T T A G T C T T G T T A G T A G T A A T G G A C A T A C G T A T T A A A A G C T G A C C G C G A G A C G A C C T T A T G A T T A C T T T T G G T A C A A G C T A G G T G T A G A A G A

That was a great module. I see what I was missing in my script was that I had to mutate my sequence and then randomly shuffle it between 10 and 20 times while still maintaining the same base pair distribution. I made the mistake of forgetting about the mutation. I am still a beginner and have trouble of incorporating different elements into one program. I know now that I must use srand expression.

I don't know what caused the error message you were getting but if you read the entire file into a string variable, make sure all letters are upper-case, remove all non-letter characters (including spaces and carriage-returns), and split the string into an array it should work OK.

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle); #This module includes a method to shuffle arrays.

#Create integer between 10 and 20
my $times_to_shuffle = 10 + int(rand(11));
my $filename = 'Test_sequence.txt';
my $string = slurp_file($filename);
$string = uc($string); #Make sure all letters are upper case
$string =~ s/[^AGCT]//; #Remove all characters that are not A, G, C, or T

my @original = $string =~ m/[AGCT]/g; #Assign all bases to an array
my $base_count = @original; #Count elements in array
print "Number of bases in sequence is $base_count \n";

print "Shuffle sequence $times_to_shuffle times.\n";
my @shuffled = @original; #Copy original array to array to be shuffled.
foreach (1..$times_to_shuffle){
    @shuffled = shuffle(@shuffled);
}

print "@shuffled\n";

sub slurp_file {
    my $file = $_[0];
    local $/;
    open( FH, '<', $file ) or die "Could not open $file ... $!";
    my $text = <FH>;
    return $text;
}

Gives the following output:

Number of bases in sequence is 829 
Shuffle sequence 11 times.
G T A A G A T A A T A A A A G T G T T G T A G C A A G G A T T A G T A T A T A C G C C T C T C T G T T C A A C C C G C C C A C C A A A T C T G G A T T C G G A C T A T A T A T G G C T C A A G G G T T G G T C G A G A A T C G A C G T C A A G T T A G G C T A T A A A G C G G C G T G A G T C G G A G A T G T T C C G G C T T T C A G A G T C T A T C T A A A T T G T T G T T C G G A G G T G A C T A T A G G G G A C T C G G A A T C G T C T C C G G A A G A G A G T G T T C C T T C T A C T G C A G G C T A C T T A A A C T C T T A C T A T A A T G T A A C G C T A C C A T T C G C T G C A G T G T T G T C G C G C C A T T C C T T T G T G T A A G A T T C C T T A G G A C C A C C A C A C A A G G T G G T A G T A T A T A A A C A G G A G G A T A C A A A T T A T G C A A G G C G T T C A C A C C C G A A T T G G C T C C A A C A T G T G T A C T A C C A G A A G C T A T C C C G A C A T C T T T T C G A T A G A T T G G T T A C C C A A T A C A T T G C A G C C A T T T G C C A G A T G T T G G T T A C C A G A G A C T G G T T A A G G T G A G C C T A G G C A A A G A T C G A A T C A T G C C G G A A C C T A C T G A T C T A T G A T A G T T A G C A A A T A T G A T G G T T G A C G A G T A T G T A T A A C T T C G C G G A G T A T T A C A G G A A A A G C G A A A A G A C T G T G T T C C A T A A A T T A A T G A A C G T C T G A G G A A T T C C C T C G T G C C T T A G C C A T T C G T T G T G T T A T G G T A G T A T A A T C T C A T C G A A T G C A A G C G A T C G C T G A A C A C G G T G C A G T G T C T G C G T T A G T C T T G T T A G T A G T A A T G G A C A T A C G T A T T A A A A G C T G A C C G C G A G A C G A C C T T A T G A T T A C T T T T G G T A C A A G C T A G G T G T A G A A G A

I have tried numerous times to incorporate all aspects of my script but I am only getting it to shuffle but not the random shuffle and not the mutation. Can you tell me what order I need to use with my subroutines in order to mutate the DNA sequence and then perform the random shuffle and then calculate the z score? I first put the srand expression and then my sequence and then a subroutine to shuffle the sequence. I don't understand how I am supposed to compare the original sequence and the mutated sequence. Thank you for all your help

I have tried numerous times to incorporate all aspects of my script but I am only getting it to shuffle but not the random shuffle and not the mutation. Can you tell me what order I need to use with my subroutines in order to mutate the DNA sequence and then perform the random shuffle and then calculate the z score? I first put the srand expression and then my sequence and then a subroutine to shuffle the sequence. I don't understand how I am supposed to compare the original sequence and the mutated sequence. Thank you for all your help

The script you quoted reads a file into an array called @original, copies @original to @shuffled and then shuffles @shuffled the required number of times (between 10 and 20). You can add logic to the script you quoted to define the rules for assigning z-scores to base pairs constructed from @original and @shuffled. This logic is demonstrated in the script in the post at http://www.daniweb.com/forums/post1380986.html#post1380986. At the end you will have an array of z-scores called @scores. The first score in @scores is determined by the first base-pair in @base-pairs, and so forth. I don't know what you want to do with the @scores array. Maybe just print it? I don't know what you mean when you say the shuffle is not a random shuffle.

The script you quoted reads a file into an array called @original, copies @original to @shuffled and then shuffles @shuffled the required number of times (between 10 and 20). You can add logic to the script you quoted to define the rules for assigning z-scores to base pairs constructed from @original and @shuffled. This logic is demonstrated in the script in the post at http://www.daniweb.com/forums/post1380986.html#post1380986. At the end you will have an array of z-scores called @scores. The first score in @scores is determined by the first base-pair in @base-pairs, and so forth. I don't know what you want to do with the @scores array. Maybe just print it? I don't know what you mean when you say the shuffle is not a random shuffle.

I don't know what I am doing wrong but it seems that everytime I get one element, another issue arises. First, I used a script to mutate the sequence and keep the same base pair distribution. I must then randomly mutate the sequence 10 to 20 times and find the similarity between the mutated and original DNA according to the scores in the previous script that you mentioned. Here is what I have. If you don't mind can you let me know what I am doing wrong? Thank you

[my $sequence='A G G G C A C C T C T C A G T T C T C A T T C T A A C A C C A C
A T A A T T T T T A T T T G T A T T A T T C A G A T T T T T C A T G A A C T T T
T C C A C A T A G A A T G A A G T T G A C A T T G T T A T T T C T C A G G G T C
T C G G T T C A C C A G T A T T T G A C A A A C T T G A A G C T G A A C T A G C
T A A A G C T G C T A T G T C A T T G C C T G C A A C C A A G G G C T T T C A G
T T T G G T A G T G G G T T T G C A G G C A C C T T T T T G A C T G G G A G T G
A A C A C A A T G A T G A G T T C T A T A T A G A T G A A C A T G G A A A C A C
A A G A A C A A G A A C A A A T C G C T C T G G T G G G A T A C A G G G T G G A
A T T T C C A A T G G G G A A A T C A T T A A T A T G A G A A T A G C T T T C A
A G C C A A C A T C A A C A A T T G G A A A G A A G C A A A A T A C T G T G A C
T C G A G A T A A A A A A G A A A C A G A G T T T A T A G C C C G T G G T C G C
C A T G A T C C T T G T G T T G T C C C A A G A G C T G T A C C T A T G G T A G
A G G C A A T G G T A G C T T T G G T T C T T G T G G A C C A A T T G A T G G C
A C A A T A T G C G C A G T G T A A T C T G T T T C C C G T A A A C T C A G A T
T T G C A A G A A C C C C T G G T G C C A T A T T A C G G C C A G A A G A A G C
G C T T C T C T G A A G A T G A A G G G G G T G C A T A A A C C A G T A A T T G
G C C T T G G A T A A A A C C T T C C T T C T G G C T A G T G G T T A G T T G G
C A G G T A A T T C A C T T T G A T G A C A G G T C C A A G T T C C C C C C T T
G C G A G C A T T C G G T C C T G G C C G C A T G G G G G C A G A A A T T T T T
G G C G G G G G C A A C C T A T C G G T T T G G A T T A A A G G A A G G T A A A
C C A A G G G A A T T C C T C C C C C C C A A T T C T T T T G G G G G G A';]
[my $i;]
[my $mutant]
[srand(time|$$);
[$mutant = mutate_DNA($sequence);]
print "\nMutate DNA\n\n";
print "$sequence\n";
print "$mutant\n";

exit;

sub mutate_DNA{
my($dna) = @_;
my (@nucleotides)=('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
#Pick a random nucleotide
my ($newbase)=randomnucleotide(@nucleotides);
substr ($dna, $position, 1, $newbase);
return $dna;
}

sub randomelement{
    my(@array) = @_;
    return $array[rand @array];
  }

sub randomnucleotide{
    my(@nucs) = ('A','C','G','T');
    return $nucs[rand @nucs];
  }
sub randomposition{
    my($string) = @_;
    return int (rand(length($string)));
  }

Why do you have square brackets around some of your statements? Once I removed the square brackets it seemed to run OK. It takes a long string of bases and modifies one of the bases at a random position. I don't understand how to use the array of scores either. If you were to generate an array of scores for the sequence compared to the mutant, you would end up with an array of all zero scores except possibly one non-zero element, because all the bases are the same except one. But I don't understand what the scores mean or how they are used.

Why do you have square brackets around some of your statements? Once I removed the square brackets it seemed to run OK. It takes a long string of bases and modifies one of the bases at a random position. I don't understand how to use the array of scores either. If you were to generate an array of scores for the sequence compared to the mutant, you would end up with an array of all zero scores except possibly one non-zero element, because all the bases are the same except one. But I don't understand what the scores mean or how they are used.

Sorry about the square brackets this was my first time using the wrap code feature on this site. I thank you for all of the help and will try to see if the previous code you made can help me with the calculation of the scores.

Sorry about the square brackets this was my first time using the wrap code feature on this site. I thank you for all of the help and will try to see if the previous code you made can help me with the calculation of the scores.

Assuming the script you posted mutates a string of bases by changing one letter at one position, couldn't you calculate the scores as follows?

#!/usr/bin/perl
#score_mutant.pl
use strict;
use warnings;

my $sequence = 'AGCT'; #Short string (Should work for long strings too.)
my   $mutant = 'AGAT'; #Copy of above string except C has mutated to A.

print "Sequence,Mutant,Score\n";
foreach(0 .. length($sequence) - 1){
    my $s = substr($sequence, $_, 1);
    my $m = substr($mutant, $_, 1);
    my $score = determine_score($s, $m);
    print "$s,$m,$score\n";
}

sub determine_score{
    my ($alpha, $beta) = sort @_; #Sort two base args in alphabetical order
    
    #If the base pair did not change assign 0.
    return 0 if $alpha eq $beta;
    
    #If a purine was mutated to a purine,
    #or a pyrimidine to a pyrimidine assign a value of +1 to that base pair.
    #If a purine was mutated to a pyrimidine or vice versa
    #assign a value of -1 to that base pair.
    my %rules;
    $rules{'A'}{'G'} = +1;
    $rules{'A'}{'T'} = -1;
    $rules{'A'}{'C'} = -1;
    $rules{'G'}{'T'} = -1;
    $rules{'C'}{'G'} = -1;
    $rules{'C'}{'T'} = +1;
    
    return $rules{$alpha}{$beta};
}

This gives the following output:

Sequence,Mutant,Score
A,A,0
G,G,0
C,A,-1
T,T,0
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.