0

Hi, I have an input like this:

gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

and I would like an output like this:

gene1 pos1 description1
gene2 pos2 description2a, description2b, description2c
gene3 pos3 description3
gene4 pos4 description4a, description4b

can anybody help me with the code?

3
Contributors
6
Replies
7
Views
7 Years
Discussion Span
Last Post by aaegcm
0

Hi, I have an input like this:

gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

and I would like an output like this:

gene1 pos1 description1
gene2 pos2 description2a, description2b, description2c
gene3 pos3 description3
gene4 pos4 description4a, description4b

can anybody help me with the code?

#! /usr/local/bin/perl 
use warnings;
use strict;

#program to print one gene per line with all the ontologies
my @aoa=0;

my $input= "genes";
open (FILE,'<', $input) or die "can not open $input file";
while (my $line=<FILE>){
      my @array =split (/t/, $line);
      push (@aoa,[@array]);
      my $i;
      for ($i=1,$i<$#aoa,$i++){
           if ($aoa[$i][0] eq $aoa[$i+1][0]){
               push (@{$aoa[$i]}, $aoa[$i+1][2]);
           }
           else{
               print "@{$aoa[$i]}n";
          }
      }
}

Edited by pyTony: fixed formatting

1
#! /usr/local/bin/perl 
use warnings;
use strict;
my ($gene, $pos, $d, @descs);
#program to print one gene per line with all the ontologies
while (<DATA>) {
    chomp;
    if (!defined $gene or m/^$gene/) {
        #Never mind
    }
    else {
        @descs = printsummary($gene, $pos, @descs);
    }
    ($gene, $pos, $d) = split(/\s+/);
    push (@descs, $d)
}
@descs = printsummary($gene, $pos, @descs);

sub printsummary {
    my ($g, $p, @d) = @_;
    my $str = join ", ", ($g, $p, @d);
    print "$str\n";
    return ();
}


__DATA__
gene1	pos1	description1
gene2	pos2	description2a
gene2	pos2	description2b
gene2	pos2	description2c
gene3	pos3	description3
gene4	pos4	description4a
gene4	pos4	description4b
1
#! /usr/local/bin/perl
use warnings;
use strict;

undef $/;			
my $format=<DATA>;

$format=~ s{	([^\n]+)	# 1. not equal to newline character
		(pos\d+)	# 2. 'pos' followed by more than 1 digit
		([^\n]+)	# 3. not equal to newline character
		(.*?\2[^\n]+)+  # 4. any character followed by 'group and not equal to newline character
	    }
	    {&prepare_format($1, $2, $&)}gesx;
print $format;

sub prepare_format
{
	my ($name, $pos, $text)=@_;
	$text=~ s{$name$pos}{}g; # Remove name and pos value for global
	$text=~ s{\n}{,}g;      # Replace newline character to comma 
	$text="$name$pos".$text; # Add begin name and pos 
	return $text;
}
__DATA__
gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b
Votes + Comments
Interesting alternative solution using Regex.
0

what's the meaning of gesx?
what's the meaning of undef $/
what's is the variable $& ?
many thanks!

Edited by aaegcm: n/a

1

meaning of gesx
---------------------
g -> global match

e -> Evaluate the expression

s -> Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match.

x -> Extend your pattern's legibility by permitting whitespace and comments.


meaning of undef $/
--------------------------
That is input record separator. It treated the multi line data into single a line.

For example : -

$data=<DATA>;
print $data;
__DATA__
gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

Now $data contains:-
gene1 pos1 description1

Add undef $/; before the $data declaration part.
Now the $data got the whole line.

variable $&
--------------
It's a perl special variable. It contains the string matched by the last pattern match.

For example :-

$data =~ m{gene\d+ pos\d+};
print $&;

Now $& got "gene 1 pos1"


Hope its clear for you?

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.