Hi,

I have the data file. txt (3.96 MB) and I want to make each 50 Character on the one line.

For EX:
In put data :

GTGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCTC........
out put with 50 character on one line:
       GTGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCC
        TGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCG
         GAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGA
          AGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGAT
           GCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATC
            CCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCT
             CAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCTC

Below is my script but it take me along time to do that. Could you show me the fast way to solve this problem.

use bigint;


print "Insert the file:";
$file = <STDIN>;



if (!open (IN,"$file")){
	print "false.\n";
	pause;
}
if (!open (OUTB, ">mode.txt")){
	print "false.\n";
	pause;
	
}
if (!open (OUTC, ">mode1.txt")){
	print "false.\n";
	pause;
	
}


$data = "";


while (<IN>){
	
	if ($_ =~ />/){
		next;
	}
	
	
	$_ =~ s/\r//;
	$_ =~ s/\n//;
	
	
	$data = $data.$_;
}


$num = length($data);


for ($i = 0; $i < $num; $i++){
	$tag = substr($data, 0 + $i, 50);
	print (OUTB "$i\t$tag\n");   
}


print (OUTB "\n");


$data = reverse($data);

$count = 0;
$pos = 0;
$tag = "";


for ($i = 0; $i < $num + 49; $i++){
	$change = substr($data, 0 + $i, 1);
	if ($change eq "A") {
		$change =~ s/A/T/;
	}
	elsif ($change eq "T"){
		$change =~ s/T/A/;
	}
	elsif ($change eq "G"){
		$change =~ s/G/C/;
	}
	elsif ($change eq "C"){
		$change =~ s/C/G/;
	}
	
	$tag = $tag.$change;
	$count++;
	
	if($count == 50){
		print (OUTC "\t$tag\n"); 
		$count = 0;
		$tag = "";
		$pos++;
		$i = $pos;
	}
}
print "\ndata finished.\ncheck 「Model and model.txt.\n";
close (IN);
close (OUTB);
close (OUTC);

Recommended Answers

All 12 Replies

Suppose file.txt contains the following:

CTTA TAAC GACC CCCG CCGA CACG GCAG TGAG CGCA GCAG CGAC GCGT GGCT CTTG TAAT 
AACC AATG CGCT TGCG AAAT CAGC TAGC CCAT TTGA TAAA GTAA GGGC TCGA GAGG ATTT 
GGCA TTAA GCAC GGCT TGTG CCTA CCTC TGGT TTCC GTGT CTAC ACAG TAGT CGGC TGTC 
TATC TGTT CGTC CGAC CGCT
#!/usr/bin/perl
#print_files_in_subdirs.pl
use strict;
use warnings;

my $input_filename = 'file.txt';
my $data = slurp_file($input_filename);

$data =~ s/\s//g;#Remove all space, newline, etc.
$data =~ s/(\w{50})/$1\n/g;

print $data;

sub slurp_file{
    my $filename = shift;
    local $/=undef;
    open my $fh, $filename or die "Couldn't open file: $!";
    my $string = <$fh>;
    return $string;
}

Output:

CTTATAACGACCCCCGCCGACACGGCAGTGAGCGCAGCAGCGACGCGTGG
CTCTTGTAATAACCAATGCGCTTGCGAAATCAGCTAGCCCATTTGATAAA
GTAAGGGCTCGAGAGGATTTGGCATTAAGCACGGCTTGTGCCTACCTCTG
GTTTCCGTGTCTACACAGTAGTCGGCTGTCTATCTGTTCGTCCGACCGCT

Thanks you very much. It work well.
I sorry I did not have the good question. I mean I find 50 chacter on on line and then try revese data. When I revese data I have to chance A=T, T=A, G=C, and C=G.

For EX: 
 input: GTGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCTCACA

 output 
Data:
GCTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCG
TCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGA
CCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGAC

Reverse Data:
TGTGAGAGTCGGCAACAACTTGGCGCCAGGTTTCGAGCAAAGAAGATGAG
GTGAGAGTCGGCAACAACTTGGCGCCAGGTTTCGAGCAAAGAAGATGAGT
TGAGAGTCGGCAACAACTTGGCGCCAGGTTTCGAGCAAAGAAGATGAGTT

When I tried to download your attached ref1.txt I got an error message from Daniweb saying "/tmp/Xfx9+ApI.part could not be saved, because the source file could not be read" so I can't see the data.

#!/usr/bin/perl
use strict;
use warnings;

while(my $rec = <DATA>){
    chomp($rec);
    $rec = reverse($rec);
    $rec =~ tr/ATGC/TACG/;
    print "$rec\n";
}
__DATA__
GCTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCG
TCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGA
CCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGAC

You already know how to reverse a text string. To replace A with T, T with A, etc. you could use the transliteration function $rec =~ tr/ATGC/TACG/; Since I don't get the same output you want, I may have misunderstood the question.

hi d5e3,
Thank you very much for show me $rec =~ tr/ATGC/TACG/; I think it help me cript work fastly.

My work: 1.Find 50 base on the one line with each chacter.
2.same with 1 but with the reserve data

input: I have data with 60 chacter (1...60)
GTGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCTCACA
out put :
  question 1: Begin G until 50 charater.(from left to right)
              GCTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCG
              Begin C until 50 charater.(from left to right)
              CTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGG
              ..................................................
 question 2:  resever (input data)
              Begin T until 50 charater
              TGTGAGAGTCGGCAACAACTTGGCGCCAGGTTTCGAGCAAAGAAGATGAG
              Beign G until 50 charater
              GTGAGAGTCGGCAACAACTTGGCGCCAGGTTTCGAGCAAAGAAGATGAGT
              ..................................................

my code repaired $rec =~ tr/ATGC/TACG/; below

Could you plese show me more advice to make cript run faster because my data abou 3.2MB.

use bigint; 
use strict;
use warnings;

print "Insert the file:";
my $file = <STDIN>;

if (!open (IN,"$file")){
	print "false.\n";
	sleep;
}
open (OUTB, ">mode.txt");
open (OUTC, ">mode1.txt");

my $data = "";

while (<IN>){
	
	if ($_ =~ />/){
		next;
	}
	
	
	$_ =~ s/\r//;
	$_ =~ s/\n//;
	
	
	$data = $data.$_;
}

my $num = length($data);


for (my $i = 0; $i < $num; $i++){
	my $tag = substr($data, 0 + $i, 50);
	print (OUTB "$i\t$tag\n");   
}


my $data1 = reverse($data);
$data1 =~ tr/A|T|G|C/T|A|C|G/;

for (my $i = 0; $i < $num; $i++){
	my $tag = substr($data1, 0 + $i, 50);
	print (OUTC "$i\t$tag\n");   
}




print "\ndata finished.\ncheck 「Model and model.txt.\n";
close (IN);
close (OUTB);
close (OUTC);

Sorry, I don't know how to make your script run faster other than what I already said about slurping the file into your scalar variable instead of reading it one line at a time.

Taking 50 substrings starting at each character in a large file is probably taking most of the runtime, and I don't know a way of getting the substrings faster.

The regex engines may reduce the process time, Instead of use the 'substr' inside of the 'for' loop for this case.

#!/usr/bin/perl
use strict;
use warnings;

my $name='GTGAGCCAGAACTCATCTTCTTTGCTCGAAACCTGGCGCCAAGTTGTTGCCGATCTC';
my $num='50';

while($name=~ m{.{$num}}g)
{
	print "\n\nFirst $num characters\t: $&";
	
	my $reverse = reverse ($&);
	print "\nReverse $num characters\t: $reverse";

	# you may print the $reverse to some file handle 
	$reverse =~ tr/A|T|G|C/T|A|C|G/;
	print "\nOutput of the sequence\t: $reverse";

	# Remove the first character of $name.
	# So $name will be reset and ready to find the next $num characters
	$name=~ s{^.}{};
}

Try the below code in your 3.2MB data

#!/usr/bin/perl
use strict;
use warnings;

### Inputs 
my $input='input.txt';
open (FIN, "$input") || die "Cannot open the $input file : $!";
read FIN, my $file, -s FIN;
close (FIN);

### no of occurence to match
my $num='50';

### Output
open (FOUT, ">output.txt") || die "Cannot create the output file : $!";

while($file=~ m{.{$num}}g)
{
	my $reverse = reverse ($&);
	$reverse =~ tr/A|T|G|C/T|A|C|G/;
	print FOUT "\n$reverse";

	# Remove the first character of $name.
	# So $name will be reset and ready to find the next $num characters
	$name=~ s{^.}{};
}

close (FOUT);

Sorry, change the variable '$name' to '$file' in line number 25

thank you very much. The cript work good, but with the long file it have some problems. It have a good result at first then it have the same result (about 6 times). COuld you show how to solve that problems.

I don't know, what you have some problems. But I guess you may want to process each line and create the output as the possible sequences.

#!/usr/bin/perl
use strict;
use warnings;

### Inputs 
my $input='ref1.txt';
open (FIN, "$input") || die "Cannot open the $input file : $!";

### no of occurence to match
my $num='50'; my $count=1;

### Output
open (FOUT, ">output.txt") || die "Cannot create the output file : $!";

while (<FIN>)
{
	my $line = $_; chomp($line);
	print FOUT "\n\nLine $count\t\t: $line"; my $seq=1;
	while($line=~ m{.{$num}}g)
	{
		my $reverse = reverse ($&);
		$reverse =~ tr/A|T|G|C/T|A|C|G/;
		print FOUT "\nSequence $seq\t: $reverse";

		# Remove the first character of $line.
		# So $name will be reset and ready to find the next $num characters
		$line=~ s{^.}{};
		$seq++;
	}
	$count++;
}

close (FIN);
close (FOUT);

Thank you so much, I just add

print FOUT "\n\n first $num chaters\1:$&";

which you made at one day. It is all I hope to do and run faster.

**Deleted** (Didn't notice posts on page 2. Looks like this has already been solved).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.