Hi, I have an input like this:

gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

and I would like an output like this:

gene1 pos1 description1
gene2 pos2 description2a, description2b, description2c
gene3 pos3 description3
gene4 pos4 description4a, description4b

can anybody help me with the code?

Hi, I have an input like this:

gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

and I would like an output like this:

gene1 pos1 description1
gene2 pos2 description2a, description2b, description2c
gene3 pos3 description3
gene4 pos4 description4a, description4b

can anybody help me with the code?

#! /usr/local/bin/perl 
use warnings;
use strict;

#program to print one gene per line with all the ontologies
my @aoa=0;

my $input= "genes";
open (FILE,'<', $input) or die "can not open $input file";
while (my $line=<FILE>){
      my @array =split (/t/, $line);
      push (@aoa,[@array]);
      my $i;
      for ($i=1,$i<$#aoa,$i++){
           if ($aoa[$i][0] eq $aoa[$i+1][0]){
               push (@{$aoa[$i]}, $aoa[$i+1][2]);
           }
           else{
               print "@{$aoa[$i]}n";
          }
      }
}

Edited 3 Years Ago by pyTony: fixed formatting

#! /usr/local/bin/perl 
use warnings;
use strict;
my ($gene, $pos, $d, @descs);
#program to print one gene per line with all the ontologies
while (<DATA>) {
    chomp;
    if (!defined $gene or m/^$gene/) {
        #Never mind
    }
    else {
        @descs = printsummary($gene, $pos, @descs);
    }
    ($gene, $pos, $d) = split(/\s+/);
    push (@descs, $d)
}
@descs = printsummary($gene, $pos, @descs);

sub printsummary {
    my ($g, $p, @d) = @_;
    my $str = join ", ", ($g, $p, @d);
    print "$str\n";
    return ();
}


__DATA__
gene1	pos1	description1
gene2	pos2	description2a
gene2	pos2	description2b
gene2	pos2	description2c
gene3	pos3	description3
gene4	pos4	description4a
gene4	pos4	description4b
#! /usr/local/bin/perl
use warnings;
use strict;

undef $/;			
my $format=<DATA>;

$format=~ s{	([^\n]+)	# 1. not equal to newline character
		(pos\d+)	# 2. 'pos' followed by more than 1 digit
		([^\n]+)	# 3. not equal to newline character
		(.*?\2[^\n]+)+  # 4. any character followed by 'group and not equal to newline character
	    }
	    {&prepare_format($1, $2, $&)}gesx;
print $format;

sub prepare_format
{
	my ($name, $pos, $text)=@_;
	$text=~ s{$name$pos}{}g; # Remove name and pos value for global
	$text=~ s{\n}{,}g;      # Replace newline character to comma 
	$text="$name$pos".$text; # Add begin name and pos 
	return $text;
}
__DATA__
gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b
Comments
Interesting alternative solution using Regex.

what's the meaning of gesx?
what's the meaning of undef $/
what's is the variable $& ?
many thanks!

Edited 6 Years Ago by aaegcm: n/a

meaning of gesx
---------------------
g -> global match

e -> Evaluate the expression

s -> Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match.

x -> Extend your pattern's legibility by permitting whitespace and comments.


meaning of undef $/
--------------------------
That is input record separator. It treated the multi line data into single a line.

For example : -

$data=<DATA>;
print $data;
__DATA__
gene1 pos1 description1
gene2 pos2 description2a
gene2 pos2 description2b
gene2 pos2 description2c
gene3 pos3 description3
gene4 pos4 description4a
gene4 pos4 description4b

Now $data contains:-
gene1 pos1 description1

Add undef $/; before the $data declaration part.
Now the $data got the whole line.

variable $&
--------------
It's a perl special variable. It contains the string matched by the last pattern match.

For example :-

$data =~ m{gene\d+ pos\d+};
print $&;

Now $& got "gene 1 pos1"


Hope its clear for you?

This article has been dead for over six months. Start a new discussion instead.