I have some data with a lot of colums but I just want to put out some colums in the diffirent file.
I find in forum that have the question http://www.daniweb.com/software-development/perl/threads/377421 . I used that to solve my problem but it is not seccesfull.
I can put out the data with
num name product star_posi end_posi
1 [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]
but I can not seperare the number of [location] and I can not delete [] of data.
Below is my data:
num name product kind star_posi end_posi
1 gene=KIQ_00005 protein=hypothetical protein 1 423
2 gene=KIQ_00010 protein=hypothetical protein 710 1225
etc...............................................
Could you help me to sovle this problem? Thank you very much.
num name product star_posi end_posi
1 [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]
but I can not seperare the number of [location] and I can not delete [] of data...
To remove the square brackets from your text string you can do a regex substitution.
#!/usr/bin/perl;
use strict;
use warnings;
my $name = '[gene=KIQ_00005]';
print "$name\n";
my $character_to_remove = '\['; #add required escape character before [
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
print "$name\n";
$character_to_remove = '\]'; #add required escape character before ]
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
print "$name\n";
To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.
#!/usr/bin/perl;
use strict;
use warnings;
my $location = '[location=complement(<1..423)]';
my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits
print "Start position is $start and end position is $end";
To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.
#!/usr/bin/perl;
use strict;
use warnings;
my $location = '[location=complement(<1..423)]';
my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits
print "Start position is $start and end position is $end";
Thank d5e5 very much.
I used your advices to make the script. It is run very well when I selected the data.
Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :
Id product start end
gene=KIQ_00005 protein_id=EHE85001.1 1 423
gene=KIQ_00010 protein_id=EHE85002.1 710 1225
Could you show me how can I put out that form 1406_01.txt data?
...Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :
Id product start end
gene=KIQ_00005 protein_id=EHE85001.1 1 423
gene=KIQ_00010 protein_id=EHE85002.1 710 1225
Could you show me how can I put out that form 1406_01.txt data?
Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial .
#!/usr/bin/perl;
use strict;
use warnings;
my $filename = '14067_01.txt';
open my $fh, '<', $filename or die "Failed to open $filename: $!";
printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
$loc = 'undefined' unless defined $loc;
my ($start, $end) = $loc =~ m/\d+/g;
printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;
Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial .
#!/usr/bin/perl;
use strict;
use warnings;
my $filename = '14067_01.txt';
open my $fh, '<', $filename or die "Failed to open $filename: $!";
printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
$loc = 'undefined' unless defined $loc;
my ($start, $end) = $loc =~ m/\d+/g;
printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;
Hi d5e5,
Thank you for your help and support. it run very well. Could you please show me more 2 questions?
1. How can I write more the colums in the same line of your script?
my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
This my repair :
my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
my ($po) = $_ =~ m/\[(protein=.+?)\]/g;
2. I have 43 files data to analyse in my project. Do you have the way to run 43 files data on one times?
If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g;
should work OK.
If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.
#!/usr/bin/perl;
use strict;
use warnings;
#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names
#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
chomp($rec);
print "$rec\n";
}
You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.
#!/usr/bin/perl;
use strict;
use warnings;
@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character
#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
chomp($rec);
print "$rec\n";
}
If the script reads each line into the $rec variable, then you won't find anything in the $_ variable.
my ($po) = $rec =~ m/\[(protein=.+?)\]/g;
should work OK.
If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.
#!/usr/bin/perl;
use strict;
use warnings;
#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names
#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
chomp($rec);
print "$rec\n";
}
You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.
#!/usr/bin/perl;
use strict;
use warnings;
@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character
#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
chomp($rec);
print "$rec\n";
}