d5e5 109 Master Poster

wooh it is wonderful. Thanks d5e5!
Could you show me the mean of two sentence and why we should use that ?
$name =~ m/(\d+)\s(\w+\d\d)/;
$key = lc("$1_$2");
I just know
\d+ : Matches a digit [0-9].
\w+ : matches an anpha chater
\s: Matches a whitespace character
Thank you so much.

$name =~ m/(\d+)\s(\w+\d\d)/; means that when the text in $name contains a string of one or more digits followed by a space followed by one or more alphanumeric characters followed by exactly two digits we want the first string of digits saved in $1 and the alphanumeric string and the following two digits saved in $2.

Because we don't want to use all the text in $name, we capture the desired text by means of parentheses in the regex pattern. When a match occurs, the value corresponding to the pattern in the first parentheses is captured into the special Perl variable $1 and the value corresponding to the pattern in the second parentheses is captured into $2. We want the key in the hash to be the same as what we will read in the data1.txt so we put $1 and $2 together in a string with an underscore _ between them. See Extracting Matches.

\d matches one digit.
\d+ matches one or more digits (See Matching Repetitions.
"\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also …

d5e5 109 Master Poster
#!/usr/bin/perl
use strict; 
use warnings; 

my %aas; #Hash to store amino acids

read_amino_acids('data2.txt');

read_positions('data1.txt');

sub read_amino_acids{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    my ($name, $key);
    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my @flds = split /\|/;
        if (@flds > 1){
            $name = $flds[4];
            $name =~ m/(\d+)\s(\w+\d\d)/;
            $key = lc("$1_$2");
            undef $aas{$key} unless exists $aas{$key};
        }
        else{
            $aas{$key} .= $_;
        }
    }
}

sub read_positions{
    my ($filename) = @_;
    open my $fh, '<', $filename or die "Failed to open $filename: $!";
    
    print "Name               posi          amino acid\n";

    while (<$fh>){
        s/\s+$//; #Remove end-of-line characters
        my ($name, $pos) = split;
        next unless $name =~ m/^\d+_/;
        my $tuple = substr $aas{$name}, $pos - 1, 3;
        printf "%s%7d%16s\n", ($name,$pos,$tuple);
    }
}
d5e5 109 Master Poster

The Dot Matches (Almost) Any Character

Also have a look at Repetition with Star and Plus and after reading the first part, scroll down to the paragraph about Laziness vs. Greediness which explains what the ? character does in this context.

d5e5 109 Master Poster

The value of whatever matches what is in the first pair of parentheses goes into $1, and the value of whatever matches what is in the second pair of parentheses goes into $2. Here's an example of capturing matches into the $1 and $2 variables.

#!/usr/bin/perl;
use strict;
use warnings;

my $line = "Hello Harry, how is your house?";

$line =~ m/(\bh\w+).*?(\bh\w+)/i;

print '$1 = ', "$1\n";
print '$2 = ', "$2\n";

Note that it's not as flexible as thines01 script because the pattern will capture only the first to words starting with 'h' or 'H' and ignores the rest.

d5e5 109 Master Poster

An associative array, otherwise known in Perl as a hash, does not preserve the order in which the elements were added. If you want to list the contents of a hash in a particular order you can sort the keys according to rules which give you the desired result. If you want to sort by keys, or values, in ascending or descending order, by numeric or alphabetic comparisons, etc. you can do that. But 'the same order' in which elements were added to a hash cannot be determined. See Perl Tutorial - Hashes.

If you need to save key-value pairs in a particular sequence you could build a more complex data structure, such as an array of references to hashes. See Arrays of Hashes.

d5e5 109 Master Poster

I was 2 minutes to slow. True, I would not have remove the space at the end.

It's not an ordinary space character at the end of each line. It's a carriage return character that causes my text editor to warn me "This line does not end with the expected EOL: 'LF'..." (see attached screenshot.) The input file has Windows format newlines (CRLF) but linux expects linux-type newlines so perl's chomp command removes only the LF. When the script prints it re-adds LF to every line. If all the lines have carriage return characters except for the last then the output file has CRLF newline characters on all lines but the last, which will have only LF.

That's not a biggie normally, unless the output file will be processed as input by another script that gets confused by mixed line endings.

d5e5 109 Master Poster

See if this does what you want.

#!/usr/bin/perl
use strict;
use warnings;
if ( !defined($ARGV[0])){
  print "\nUsage: $0 <name>\n";
  print "  Example:  $0 AGQQ01000003.1\n\n";
  exit;
}
my ($filename3,$filename,$printIt) = ('file_out.txt','dna.txt',0);
my @columns;
my $pat = quotemeta($ARGV[0]);
open my $fh, '<', $filename or die "Failed to open $filename: $!";
open my $fho, '>', $filename3 or die "Failed to open $filename3: $!";#Open file for output
while (my $rec = <$fh>){
 chomp($rec);
 @columns = split(/\|/,$rec);
 if ( $#columns >= 3 ) {
    if ( $columns[3] =~ /$pat/ ){
       #print $fho "$columns[3]: ";
       $printIt = 1;
    }
    else {
       $printIt = 0;
    }
 }
 print $fho "$rec\n" if ( $printIt && $#columns == 0 ); 
} 
close $fh;

That way looks good too, histrungalot. One thing: I notice you left out the $rec=~ s/\s//g; statement the original script had. That may or may not be wanted depending on whether you want to keep the Windows-style CRLF newlines or if you want to replace them with those of the current platform (in my case, linux).

d5e5 109 Master Poster
#!/usr/bin/perl;
use strict; 
use warnings; 
my $filename3 = 'file_out.txt';
my $filename = 'dna.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";
open my $fho, '>', $filename3 or die "Failed to open $filename3: $!";#Open file for output

my $name;
while (my $rec = <$fh>){
    $rec=~ s/\s//g;
    chomp($rec);
    #If reading first line of data (starts with '>') for a name, save the name
    if ($rec =~ m/^>/){
        my @flds = split(/\|/, $rec);#split first line of group to get name
        $name = $flds[3];
        next; #Read next line (you don't want to print first line of group)
    }
    
    if ($name eq 'AGQQ01000002.1'){
        print $fho "$rec  \n";
    }
}
close $fh;
d5e5 109 Master Poster

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}
d5e5 109 Master Poster

...Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;
d5e5 109 Master Poster

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";
d5e5 109 Master Poster

...I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data...

To remove the square brackets from your text string you can do a regex substitution.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $name = '[gene=KIQ_00005]';
print "$name\n";

my $character_to_remove = '\['; #add required escape character before [
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
print "$name\n";

$character_to_remove = '\]'; #add required escape character before ]
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
print "$name\n";
d5e5 109 Master Poster
#!/usr/bin/perl;
use strict; 
use warnings; 

my $n = 3;#Decided position
my $line_count = 0;
my $filename = 'sample file.txt';
open my $fh, '<', $filename or die "Failed to open $filename: $!";

LINE: while (1){
    my %hash = ();
    foreach (1 .. $n){
        my $rec = <$fh>;
        last LINE unless defined $rec;
        $rec = <$fh> while $rec =~ m/^#/;#Skip comment lines
        
        $rec =~ s/\s*$//;#Remove spaces or newline characters from end
        $hash{++$line_count} = $rec;
    }

    my @array = sort   {my ($sa) = split /\s+/, $hash{$a};
                    my ($sb) = split /\s+/, $hash{$b};
                    $sb <=> $sa;} keys(%hash);
    
    my $label = "decided position $line_count\n";
    foreach(@array){
        print $label;
        printf "%15d%10s\n", ($_, $hash{$_});
        $label = '';
    }
}
d5e5 109 Master Poster
#!/usr/bin/perl;
use strict; 
use warnings; 
use Data::Dumper;
my $n = 3;#Decided position
my $filename = 'sample file.txt';
open my $fh, '<', $filename or die "Failed to open $filename: $!";

while (<$fh>){
    next if m/^#/;#Skip comment lines
    my @array = ();
    foreach (1 .. $n){
        my $rec = <$fh>;
        last unless defined $rec;
        
        $rec =~ s/\s*$//;#Remove spaces or newline characters from end
        push @array, $rec;
    }

    @array = sort   {my ($sa) = split /\s+/, $a;
                    my ($sb) = split /\s+/, $b;
                    $sb <=> $sa;} @array;
    
    my $label = "decided position $n";
    foreach(@array){
        printf "%-23s%s\n", ($label, $_);
        $label = '';
    }
}
d5e5 109 Master Poster

You need to save your base information into a variable as you read it, and save this into one of your hashes, along with position.

#!/usr/bin/perl;
use strict;
use warnings;
use autodie;
my ( %data1, %data2 );
open my $in, '<', 'baselist.txt';
while (<$in>) {
    next unless /^\s*\d/;
    my ( $num, $posi, $base ) = split;#Read base into a variable
    $data1{$num}{'base_posi'} = $posi;#Save position into your hash
    $data1{$num}{'base'}      = $base;#Save base into your hash
}
open $in, '<', 'lao1.txt';
while (<$in>) {
    next unless /^\s*\d/;
    my ( $num, $SNP_posi, $ref, $mut ) = split;
    $data2{$num}{'SNP_posi'} = $SNP_posi;
    $data2{$num}{'ref'}      = $ref;
    $data2{$num}{'mut'}      = $mut;
}
close $in;
open( SNP, ">SNP.txt" );
for my $num ( keys %data1 ) {
    my $val  = $data1{$num}{'base_posi'};###
    my $base = $data1{$num}{'base'};###
    for my $num2 ( keys %data2 ) {
        my $min    = $data2{$num2}{'SNP_posi'};
        my $max    = $data2{$num2}{'ref'};
        my $tengen = $data2{$num2}{'mut'};
        if ( $val eq $min ) {
            print SNP $val . "\t";
            print SNP $base . "\t";
            print SNP $tengen . "\t";
            print SNP $max . "\n";
            last;
        }
    }
}
close(SNP);
d5e5 109 Master Poster

You need a chomp; statement before the statement that does the split. Otherwise the last codon on each line will include a newline character and so will not match any codon in your hash.

d5e5 109 Master Poster

You could create another variable to hold the position value and include that variable in the string that you print.

Instead of my ($ref, $mul) = split /\s+/, $rec; you could have my ($pos, $ref, $mul) = split /\s+/, $rec;

d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

my %codon2proteins = build_hash();

my $dna_filename = 'dna.txt';

open my $dna_fh, '<', $dna_filename or die "Failed to open $dna_filename: $!";

while (my $rec = <$dna_fh>){
    chomp($rec);
    my ($ref, $mul) = split /\s+/, $rec;
    my $ref_pro = convert_codon2prot($ref);
    my $mul_pro = convert_codon2prot($mul);
    my $info = compare_pros($ref_pro, $mul_pro);
    print "$ref\t$mul\t$ref_pro\t$mul_pro\t$info\n";
}

sub compare_pros{
    my ($r, $m) = @_;
    if ($r eq $m){
        return 'same';
    }
    else {
        return 'change';
    }
}

sub convert_codon2prot{
    my ($codon) = @_;
    
    if(exists $codon2proteins{$codon}){
        return $codon2proteins{$codon};
    }
    else{
        die "Bad codon $codon!!\n";
    }
}

sub build_hash{
    my(%g)=(
            'TCA'=>'S', #Serine
            'TCC'=>'S', #Serine
            'TCG'=>'S',  #Serine
            'TCT'=>'S', #Serine 
            'TTC'=>'F', #Phenylalanine 
            'TTT'=>'F', #Phenylalanine 
            'TTA'=>'L', #Leucine 
            'TTG'=>'L', #Leucine 
            'TAC'=>'Y', #Tyrosine 
            'TAT'=>'Y', #Tyrosine 
            'TAA'=>'_', #Stop 
            'TAG'=>'_', #Stop 
            'TGC'=>'C', #Cysteine 
            'TGT'=>'C', #Cysteine 
            'TGA'=>'_', #Stop 
            'TGG'=>'W', #Tryptophan 
            'CTA'=>'L', #Leucine 
            'CTC'=>'L', #Leucine 
            'CTG'=>'L', #Leucine 
            'CTT'=>'L', #Leucine 
            'CCA'=>'P', #Proline 
            'CAT'=>'H', #Histidine 
            'CAA'=>'Q', #Glutamine 
            'CAG'=>'Q', #Glutamine 
            'CGA'=>'R', #Arginine 
            'CGC'=>'R', #Arginine 
            'CGG'=>'R', #Arginine 
            'CGT'=>'R', #Arginine 
            'ATA'=>'T', #Isoleucine 
            'ATC'=>'T', #Isoleucine 
            'ATT'=>'T', #Isoleucine 
            'ATG'=>'M', #Methionine 
            'ACA'=>'T', #Threonine 
            'ACC'=>'T', #Threonine 
            'ACG'=>'T', #Threonine 
            'ACT'=>'T', #Threonine 
            'AAC'=>'N', #Asparagine 
            'AAT'=>'N', #Asparagine 
            'AAA'=>'K', #Lysine 
            'AAG'=>'K', #Lysine 
            'AGC'=>'S', #Serine#Valine 
            'AGT'=>'S', #Serine 
            'AGA'=>'R', #Arginine 
            'AGG'=>'R', #Arginine 
            'CCC'=>'P', #Proline 
            'CCG'=>'P', #Proline 
            'CCT'=>'P', #Proline 
            'CAC'=>'H', #Histidine 
            'GTA'=>'V', #Valine 
            'GTC'=>'V', #Valine 
            'GTG'=>'V', #Valine 
            'GTT'=>'V', #Valine 
            'GCA'=>'A', #Alanine 
            'GCC'=>'A', #Alanine 
            'GCG'=>'A', #Alanine 
            'GCT'=>'A', #Alanine 
            'GAC'=>'D', #Aspartic Acid 
            'GAT'=>'D', #Aspartic Acid 
            'GAA'=>'E', #Glutamic Acid 
            'GAG'=>'E', #Glutamic Acid 
            'GGA'=>'G', #Glycine 
            'GGC'=>'G', #Glycine 
            'GGG'=>'G', #Glycine 
            'GGT'=>'G', #Glycine 
    );
    return %g;
}

Outputs

AGA	AAA	R	K	change
CCA	CCT	P	P	same
GCA	ACA	A	T	change
GCA …
d5e5 109 Master Poster

I don't see how your script decides whether to print 'change' or 'same'? What is the rule that determines this?

d5e5 109 Master Poster

input1.csv

3    ATG   
2    ACT
1    ATC

input2.csv

G    C 
C    A
A    A
#!/usr/bin/perl;
use strict;
use warnings;

my ($filename1, $filename2) = ('input1.csv', 'input2.csv');

open my $fh1, '<', $filename1 or die "Failed to open $filename1: $!";
open my $fh2, '<', $filename2 or die "Failed to open $filename2: $!";

while (my $rec1 = <$fh1>){
    defined (my $rec2 = <$fh2>) or last;
    print compare($rec1, $rec2), "\n";
}

sub compare{
    my ($str1, $str2) = @_;
    my ($pos, $triplet) = split(/\s+/, $str1);
    my ($ref, $mut) = split(/\s+/, $str2);
    my $idx = $pos - 1;#index starts at 0
    my $origtriplet = $triplet;
    my $origchar = substr($triplet, $idx, 1);
    my $stat;
    
    if ($origchar eq $ref){
        substr($triplet, $idx, 1) = $mut;
    }
    
    if ($origchar eq $mut){
        $stat = 'SAME';
    }
    else {
        $stat = 'CHANGE';
    }
    
    return "$origtriplet\t$triplet\t$stat";
}

Outputs

ATG	ATC	CHANGE
ACT	AAT	CHANGE
ATC	ATC	SAME
d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

my $filename = 'input.csv';
open my $fh, '<', $filename or die "Unable to open $filename: $!";

my $firstline = <$fh>;
chomp($firstline);
print "The first line is $firstline\n";

my @fields = split(/,/, $firstline);
my $len = length($fields[1]);
print "The second field contains '$fields[1]' which has length of $len\n";
d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

my $pathway;#To save string from previous iteration, create variable outside loop
while (my $line = <DATA>) {
    chomp($line);
    
    if ($line =~ /^$/) {
        next;
    }
    elsif ($line =~ /ko/) {
        $pathway = $line;
    }
    elsif ($line =~ /K/) {
        #trim($line); #What does your trim() subroutine do?
        print "$pathway\t$line\n";
    }
    else {
        print "problem with $line\n";
    }
}

__DATA__
ko10101
K01392
K09134

ko34231
K05789

ko13452
K04665
K07881

Outputs

ko10101	K01392
ko10101	K09134
ko34231	K05789
ko13452	K04665
ko13452	K07881
d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $rec = '(9430, 3656) (9147, 14355) (133, 14393) (7917, 9513) (3719, 12775)';

$rec =~ s/['(),]//g; #Remove everything except digits and spaces

my %hash = split / /, $rec; #Split on space character and assign to hash

print Dumper(\%hash);
d5e5 109 Master Poster

We can take the greatest five from a hash without sorting. I don't know if that makes it more efficient. For example, the following prints the five colors having the longest wavelengths.

#!/usr/bin/perl
use strict;
use warnings;

my %d = (violet => 400,
         red    => 650,
         indigo => 445,
         orange => 590,
         blue   => 475,
         yellow => 570,
         green  => 510);

my @p;

foreach my $r (0 .. 4){
    foreach my $k (keys %d){
        if (!defined$p[$r]
            or $d{$k} > $d{$p[$r]}){
            $p[$r] = $k;
        }
    }
    delete $d{$p[$r]};#After saving one of top 5, delete it from hash
}

print join "\n", @p;
d5e5 109 Master Poster

Here's an example of sorting an array by a specified column.

#!/usr/bin/perl;
use strict;
use warnings;

my @array = sort howtosort <DATA>;

foreach (@array){
    chomp;
    print "$_\n";
}

sub howtosort{
    my @flds_a = split(/\|/, $a);
    my @flds_b = split(/\|/, $b);
    $flds_a[2] cmp $flds_b[2]; #compare key fields to sort
}
__DATA__
1780438|20110709|0000007704000000000000004888|7704|48881|PE|08/12/2008 11:38:54|0|1000.00
1780437|20110708|0000007704000000000000004882|7704|48882|PE|08/12/2008 11:38:54|0|1000.00
1780436|20110707|0000007704000000000000004889|7704|48887|PE|08/11/2008 11:38:54|0|1000.00
1780435|20110703|0000007704000000000000004881|7704|48888|PE|08/12/2008 11:38:54|0|1000.00
k_manimuthu commented: Done wonderful +6
d5e5 109 Master Poster

i want to be able to treat contents of a file as an array and traverse through it that way without having to store them in an array,can i do that?if so how?

Tie::File treats the contents of a file as an array, but not a 2D array. You might try using Tie::File with the recsep option and calculate that, for example, if you want the first (really the zero'th) element in the third (really the 2nd) row and there are 5 columns in every row then you can refer to element[10] of your one-dimensional array. It will take some thinking to determine how to calculate your row and column indices, but once you have it figured out the program should run fairly fast because Tie::File doesn't have to load the entire file into memory.

d5e5 109 Master Poster

thank you very much.
It is run well.
I had repair

for (my $i=1; $i<=$#arr; $i++){$arr{$i}+=$arr[$i];}{

but it was printed begin with the "Colum 0".

total of colum 0 is 9 and average is 3 
      total of colum 1 is 11 and average is 3.6666666 
     .................................................

How can I set up the colum 0 to become colum 1? It means "total of colum 1 is 9 and average is 3".
By the way, could you please show me where I could study the mean of symbol in perl and how to use it?(EX: what mean of m/^\d/)

I think you would need to convert your one-line for loop statement into a multi-statement for loop so you can create a variable for column number that will be $_ plus one.

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;

my (@arr, %arr, $row);
while(<DATA>){
    #skip non-numeric data
    next unless m/^\d/;
    
    @arr=split(/\s+/, $_);
    for (my $i=0; $i<=$#arr; $i++){
        $arr{$i}+=$arr[$i];
    }
    $row++;
}

#When $_ = 0 you want to call it column 1, etc.
for (sort keys (%arr)){
    my $col_nbr = $_ + 1;
    say ("total of column $col_nbr is $arr{$_} and average is \t", ($arr{$_}/$row));
}

__DATA__
#########
1 2 3 4 5

6 6 6 4 4
2 3 4 5 6

To find the meaning of m/^\d/ you can Google Perl regular expressions. The ^ means the beginning of the record and \d represents any numeric digit so m/^\d/ will match any data record that begins with a number. …

d5e5 109 Master Poster

You need to name your DATA section __DATA__ in capital letters, not __data__ .

You can skip records that don't begin with a number, as follows:

#skip non-numeric data
next unless m/^\d/;

Notice that the script ignores column 0 and starts with column 1, because in Perl array indexes begin at 0. Is that what you want?

d5e5 109 Master Poster

**Deleted** (Didn't notice posts on page 2. Looks like this has already been solved).

d5e5 109 Master Poster

Great! I'm glad that's what you wanted. Please remember to mark this thread solved.

d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;

while(my $rec = <DATA>){
    chomp($rec);
    #the // default delimiters for a match can be changed to arbitrary delimiters
    #The following regex is delimited by {} because pattern contains forward slashes
    $rec =~ s{/English-Folder/(.+)_e\.shtml}{/French-Folder/$1_f.shtml};
    print $rec, "\n";
}
__DATA__
/English-Folder/temp.php.u1conflict
/English-Folder/temp-post01.html
/English-Folder/temp-post.html
/English-Folder/temp.txt
/English-Folder/target01_e.shtml
/English-Folder/testcomponent.html
/English-Folder/target02_e.shtml
/English-Folder/zengarden-sample.html

Outputs:

/English-Folder/temp.php.u1conflict
/English-Folder/temp-post01.html
/English-Folder/temp-post.html
/English-Folder/temp.txt
/French-Folder/target01_f.shtml
/English-Folder/testcomponent.html
/French-Folder/target02_f.shtml
/English-Folder/zengarden-sample.html
d5e5 109 Master Poster

Sorry, I don't know how to make your script run faster other than what I already said about slurping the file into your scalar variable instead of reading it one line at a time.

Taking 50 substrings starting at each character in a large file is probably taking most of the runtime, and I don't know a way of getting the substrings faster.

d5e5 109 Master Poster

You're welcome. Please don't forget to mark this thread 'solved'.

d5e5 109 Master Poster

When I tried to download your attached ref1.txt I got an error message from Daniweb saying "/tmp/Xfx9+ApI.part could not be saved, because the source file could not be read" so I can't see the data.

#!/usr/bin/perl
use strict;
use warnings;

while(my $rec = <DATA>){
    chomp($rec);
    $rec = reverse($rec);
    $rec =~ tr/ATGC/TACG/;
    print "$rec\n";
}
__DATA__
GCTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCG
TCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGA
CCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGAC

You already know how to reverse a text string. To replace A with T, T with A, etc. you could use the transliteration function $rec =~ tr/ATGC/TACG/; Since I don't get the same output you want, I may have misunderstood the question.

d5e5 109 Master Poster

Your loop runs while($run eq "yes") but when the user enters yes at the prompt then $run contains yes with a newline character at the end. Try doing a chomp on the $run after assigning the STDIN to it.

print "Send more strings? yes/no";
$run = <STDIN>;
chomp($run);#Remove the newline character from input
d5e5 109 Master Poster

Suppose file.txt contains the following:

CTTA TAAC GACC CCCG CCGA CACG GCAG TGAG CGCA GCAG CGAC GCGT GGCT CTTG TAAT 
AACC AATG CGCT TGCG AAAT CAGC TAGC CCAT TTGA TAAA GTAA GGGC TCGA GAGG ATTT 
GGCA TTAA GCAC GGCT TGTG CCTA CCTC TGGT TTCC GTGT CTAC ACAG TAGT CGGC TGTC 
TATC TGTT CGTC CGAC CGCT
#!/usr/bin/perl
#print_files_in_subdirs.pl
use strict;
use warnings;

my $input_filename = 'file.txt';
my $data = slurp_file($input_filename);

$data =~ s/\s//g;#Remove all space, newline, etc.
$data =~ s/(\w{50})/$1\n/g;

print $data;

sub slurp_file{
    my $filename = shift;
    local $/=undef;
    open my $fh, $filename or die "Couldn't open file: $!";
    my $string = <$fh>;
    return $string;
}

Output:

CTTATAACGACCCCCGCCGACACGGCAGTGAGCGCAGCAGCGACGCGTGG
CTCTTGTAATAACCAATGCGCTTGCGAAATCAGCTAGCCCATTTGATAAA
GTAAGGGCTCGAGAGGATTTGGCATTAAGCACGGCTTGTGCCTACCTCTG
GTTTCCGTGTCTACACAGTAGTCGGCTGTCTATCTGTTCGTCCGACCGCT
d5e5 109 Master Poster

For example, you could do it as follows:

#!/usr/bin/perl;
use strict;
use warnings;

my %dict;
my $dictfilename = 'words-english.txt';

open my $fh, '<', $dictfilename or die "Failed to open $dictfilename: $!";
while (my $word = <$fh>){
    chomp $word; #Remove end-of-line character
    my $upper_word = uc($word);#Convert to upper case
    $dict{$upper_word} = undef; #Add word as key to hash (no value associated)
}
close $fh;

my $permsfilename = 'permutations.txt';
open $fh, '<', $permsfilename or die "Failed to open $permsfilename: $!";
while (my $word = <$fh>){
    chomp $word; #Remove end-of-line character
    my $upper_word = uc($word);#Convert to upper case
    if (exists $dict{$upper_word}){
        print "Yes, $word is in the dictionary.\n";
    }
    else{
        print "No, $word is NOT in the dictionary.\n";
    }
}
close $fh;

(Today I learned that 'eta' is an English word naming the seventh letter of the Greek alphabet.)

d5e5 109 Master Poster

I wrote c program that outputs all permutations of a word to a txt file.

ate
aet
tae
tea
eat
eta

I also have a txt file of all the words in the dictionary. I would like to take the first entry in permutations.txt and search dictionary.txt to see if its a valid word then the second entry ect...

I have a little experience using perl, but only for parsing one file. Would it be a good fit for what i'm trying to do. Anybody have some suggestions or point me in the right direction?

You can do this easily in Perl. A hash in Perl works sort of like a dictionary in that it consists of unique keys associated with values, so read all the words from your dictionary file into memory as keys to a hash. Since you don't care about definitions of the words you don't need to associate any values with the keys in your hash, so make the values undef . Having all your dictionary words stored as keys in a hash allows you to test if any given word exists in the hash much more easily than having to look for it in the dictionary file. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx

All your sample data words are in lower case letters but, in case future data include mixed-case words, make sure to convert every word to either upper or lower case …

d5e5 109 Master Poster

During the first 'foreach' loop, you split each column into $pt and $ps...

I split each row and assign the first two elements of the resulting list to the variables $pt (program title) and $ps (program session).

...and then assign the values of $programs{$pt} into $ps

No, I assign the content of $ps to the $programs{$pt} entry in the %programs hash. A hash contains one or more key-value pairs, where the key must be unique for the hash. This allows us to look up any value in the hash by specifying its key. The key of this particular hash entry is $pt and the value is $ps.

if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?

No, I'm checking to see if there is an entry in the %programs hash having that program title as key. Remember that the %program hash contains all data read from program.tsv so in effect I'm checking to see if that program title (from the archive file) exists in the program file.

I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";

There is no difference. I check if $programs{$pt} exists before trying to print $programs{$pt} because if I try to print a hash value for a key that doesn't exist in the hash perl will give me an error message. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx

d5e5 109 Master Poster

Because you only need to read each record from the archive file once, you don't need to save it into an array. Just iterate through it.

Because each program title should occur no more than once in the program file, you can read it into a hash for more efficient lookup than having to iterate through an array each time you want to search for a program title. The following works for me:

#!/usr/bin/perl
#match_archive.pl
use strict;
use warnings;

my %programs;#Hash to save program title => program session for each program rec
open my $fh1, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
foreach (<$fh1>){
    chomp;#Remove newline character from end of line
    my ($ps, $pt) = split(/\t/);
    $programs{$pt} = $ps;
}
close $fh1;

open my $fh2, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";#Output
foreach (<$fh2>){
    chomp;#Remove newline character from end of line
    my $pt = (split(/\t/))[1];#Paper title
    if (exists $programs{$pt}){
        print $fh3 "$_\t$programs{$pt}\n";
    }
    else{
        print $fh3 "$_\t*NA*\n";
    }
}
red711 commented: useful suggestions, nice and efficient codes. +0
d5e5 109 Master Poster

I'll have another look when I get some time but the first thing I'd suggest:

use strict;
use warnings;

belong in every non-trivial Perl script. Add them to your script and declare any undeclared variables, such as @sesname, @title1, @all, @title2 and $i.

Another general suggestion: when opening files, use lexical filehandles instead of bare words and test for success or failure to open by including an or die "blah blah: $!" clause.

open my $fh1, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh2, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";
d5e5 109 Master Poster

You are welcome. Please don't forget to mark this thread 'solved'.

d5e5 109 Master Poster

Data_1.txt

Num Posi
1 2
2 5
3 4

Data_2.txt

Num Star_posi End_posi
1 1 10
2 15 18
3 26 30
#!/usr/bin/perl;
use strict;
use warnings;

my $filename1 = 'Data_1.txt';
my $filename2 = 'Data_2.txt';
my $filename3 = 'output.txt';
my %ranges;

open my $fh, '<', $filename2 or die "Failed to open $filename2: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $start, $end) = split;
    $ranges{$num}{Star_posi} = $start;
    $ranges{$num}{End_posi} = $end;
}
close $fh;

open $fh, '<', $filename1 or die "Failed to open $filename1: $!";
open my $fho, '>', $filename3 or die "Failed to open $filename3: $!";#Open file for output
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $posi) = split;
    my $range_num = search_ranges($posi);
    if ($range_num){
        #print to output file instead of STDOUT
        print $fho "Data1 record number $num found "
            . "at position $posi found in $ranges{$range_num}{Star_posi} "
            . "and $ranges{$range_num}{End_posi} \n";
    }
    else{
        print "Data1 record number $num not found in Data2\n";
    }
}

sub search_ranges{
    my $pos = shift;
    foreach (keys %ranges){
        return $_ if $pos >= $ranges{$_}{Star_posi}
                    and $pos <= $ranges{$_}{End_posi};
    }
}

output.txt

Data1 record number 1 found at position 2 found in 1 and 10 
Data1 record number 2 found at position 5 found in 1 and 10 
Data1 record number 3 found at position 4 found in 1 and 10
d5e5 109 Master Poster

sorry I prepared the Print code in below.

print "at position  $posi found in $ranges{$num}{Star_posi} and $ranges{$num}{End_posi} \n";

But I have one problem. EX:

Data 1:                  Data2: 
      Num   Posi           Num    star      end
       1     2              1      1         10
       2     5              2      15        18
       3     4              3      26        30

How coould I find Num (3) posi (4) between 1 and 10?

I'm confused by your second example. It looks like your two data sets are side by side in one input file. Is that correct? I assumed from reading your original question that you had two separate input files.

d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

my $filename1 = 'Data_1.txt';
my $filename2 = 'Data_2.txt';
my %ranges;

open my $fh, '<', $filename2 or die "Failed to open $filename2: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $start, $end) = split;
    $ranges{$num}{Star_posi} = $start;
    $ranges{$num}{End_posi} = $end;
}
close $fh;

open $fh, '<', $filename1 or die "Failed to open $filename1: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $posi) = split;
    my $range_num = search_ranges($posi);
    if ($range_num){
        print "Data1 record number $num found in Data2 record number $range_num\n";
    }
    else{
        print "Data1 record number $num not found in Data2\n";
    }
}

sub search_ranges{
    my $pos = shift;
    foreach (keys %ranges){
        return $_ if $pos >= $ranges{$_}{Star_posi}
                    and $pos <= $ranges{$_}{End_posi};
    }
}

Outputs:

Data1 record number 1 found in Data2 record number 1
Data1 record number 2 found in Data2 record number 1
Data1 record number 3 found in Data2 record number 1
Data1 record number 4 found in Data2 record number 2
Data1 record number 5 not found in Data2
d5e5 109 Master Poster

Each hairpin id should is unique - that's why I couldn't figure out why there were so few lines.

Sorry, my mistake. I just loaded the hairpin-nearest-gene.txt containing the output from yesterday's test run into a text editor and scrolled to the end and it shows there are 168026 lines. I don't know where I got the impression there were only 8026. Maybe too much Benadryl. It looks like the hairpin ids in the input are unique, so that's good. You can see the output at http://d5e5.fyels.com/dab

d5e5 109 Master Poster

There should by a line for every hairpin id but not every line in the hairpin data file if there are multiple lines having the same hairpin id. I was assuming hairpin id can be a unique key so it can be the key to the %h_hash. If hairpin ids aren't unique then the hash will contain data for only the last line of each bunch of lines for a particular hairpin_id. If hairpin ids aren't unique, I guess I still don't understand what hairpins represent.

As for an explanation of lines 120 - 123: the following has to dereference an array reference referring to an array of array references. That's why it looks complicated. The complex syntax has nothing to do with its being in a subroutine. Each scaffold key in the hash contains a reference to an array of references to arrays containing the split gene data lines for that scaffold.

An iteration repeatedly puts the current element in the $_ variable by default (if no other variable is specified). In this case $_ contains an array reference so we must dereference it to make it an array again, like so: @$_

foreach (@{$genedata{$s}}){
        chomp;

        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = @$_;
d5e5 109 Master Poster

For each line in the gene data file we create a key value pair in the %genedata hash. The key is the gene scaffold and the value is a reference to an array containing the line of gene data that we split by spaces.

So line 120 iterates through this array. To de-reference this array we have to look it up in the hash and wrap it in dereferencing syntax resulting in @{$genedata{$s}}

d5e5 109 Master Poster

When we update the code is it possible change line 76

next unless defined $g_other and $g_other eq 'gene'

to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I moved the following statement from the read_genedata sub to the nearest_gene sub.

next unless defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other;#Read only 'gene' records

Also I modified the script to read the gene data into a hash of arrays instead of a plain array so the nearest_gene sub can look up only the gene records for a specific scaffold which serves as a key to the hash. Looking up data by key in a hash should be faster than reading through an array to find the genes for that scaffold.

Please try the following. On my laptop it takes roughly half an hour and creates an output file called 'hairpin-nearest-gene.txt' containing 8026 lines.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
#@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#Save output to a file instead of STDOUT
#because we are printing debug info to STDOUT
my $outfilename = 'hairpin-nearest-gene.txt';

my %genedata;#Hash of genedata arrays for each …
d5e5 109 Master Poster

Relational database management software can load not only the data but, if you design your database well, some of the important relationships between data elements. Because the gene data file is rather large, and we have to load the entire file into an array in memory and then loop through the array from the start to find and test rows having the same scaffold as the hairpin, and do this for each hairpin, the script takes a long time. Database software can select subsets (AKA result sets) from the data without loading all the data (together, at one time) into memory, and may do it more efficiently if you first build indexes on the columns by which your selecting.

Another advantage of relational databases is you can do your data processing in a series of steps. In the first program you can load the data files into database tables and build the appropriate indexes. In another step you can build a table associating each hairpin id with the id of the nearest gene. Then you can write one or more programs that query this data in various ways without having to repeatedly load, sort and otherwise massage the original data. That way you can build the database once and then develop and test various queries that shouldn't take as long to run as the job that builds, indexes and links the data.

The downside is I'm not an expert in SQLite and have only played with it because …