Hello!

I am very new to perl and linux systems in general, so I am hoping that someone can help with this problem!

I have a very large table of data (100 rows and between 300000 to 2 million columns) that looks like this:

sitenames	site1	site2	site3	site4	site5
sample10	A	A	N	G	T
sample22	N	N	C	G	T
sample35	N	A	M	G	N
sample47	A	A	N	N	T
sample84	G	G	A	C	N
sample240	N	A	N	C	N

I want a list of sites that are synapomorphic (have letters that occur in only certain samples and no others) for sample84, sample85, sample86, sample87, sample88, sample89, sample90 and sample240. N means missing data, M means it is a A or a C, R is A or G, Y is C or T, K is G or T, S is C or G, and W is A or T.

So the best way that I could figure to do this was to look at each column in turn, first see if samples 84-90 and 240 all have the same letter call (N or A, N or G, N or C, and N or T) but if they have more than one letter call (A and G and C and T) then delete the column.

If they were the same, then look at the rest of the samples and see if any have the same letter call. If any do, delete the column, but if none do, spit out the column name (site#) to a new file, and append as it moves through the columns.

So essentially the output for the above set would look like:

site1
site4

Hopefully that makes sense. I have no idea how to turn this into a perl script, or where to start! Please help!

By saying site1, site4 i assume you mean all of their values like:
site1: A N N A G N
site4: G G G N C C
But could you please elaborate how exactly you get this result?

By saying site1, site4 i assume you mean all of their values like:
site1: A N N A G N
site4: G G G N C C
But could you please elaborate how exactly you get this result?

Sure thing!
I just want the list "site1, site4", I don't need the values.

Site1 is good because
a) sample84 and 240 only have one letter call between them, G. The N is missing data, so it can be disregarded. There are no other letter calls in those samples (no A, C, T, R, Y, K, M, S, or W).
b) and none of the other samples have the letter call that sample84 and 240 have (no other G calls).

Site2 is no good because sample84 and 240 have different letter calls (G and A).

Site3 is no good because even though sample84 and 240 have one letter call (A), sample35 has a M at that site (M = A or C). So sample84 and 240 were not unique.

Site4 is good because
a) again sample84 and 240 only have one letter (C)
b) and no other samples have that call.

Site5 is no good because sample84 and 240 don't have a letter call (N = missing data).

I hope this makes sense. It's really complicated, so let me know if you have questions!

Well first of all we need a way to search and compare the data:

use strict;
use warnings "all";

my %hash;
my @columns;

open my $IN, "<in.txt";
while(<$IN>)
{
	if($_ =~ /^sitenames/)
	{
		@columns = split(/\s+?/, $_);		
	}
	else
	{
		my @tmp = split(/\s+?/, $_);
		for my $i (0 .. $#tmp)
		{			
			push(@{$hash{$columns[$i]}}, $tmp[$i]);
		}
	}
}
close $IN;

This will allow us to go through the data column by column like this:

# for every column
foreach my $column (sort keys %hash)
{	
	# don't check the column that contains the sample names
	unless($column eq "sitenames")
	{			
		# for every row in column
		for my $row (0 .. $#{$hash{$column}})
		{
                        # only check the samples we are interested in
			if($hash{"sitenames"}[$row] =~ 
                        /(sample84|sample85|sample86|sample87|
                        sample88|sample89|sample90|sample240)/)
			{
                             # do some checks here
                        }
                }
        }
}

Sadly it's a bit late right now so i try to figure out the actual sorting tomorrow if d5e5 hasn't solved it by then.

Apparently i can't edit my post so i have to make a new one.
It looks like the following is working. I made some changes to your input file and it appeared to still be working. However this is most likely the worst way to get this done and contains bugs. Also keep in mind i have no idea how this will hold up when tested with your 2 million columns.
That being said here's the code:

use strict;
use warnings "all";

my %hash;
my @columns;

my @samples = qw(sample84 sample85 sample86 sample87 sample88 sample89 sample90 sample240);
my %subs;

push(@{$subs{"M"}}, "A");
push(@{$subs{"M"}}, "C");
push(@{$subs{"R"}}, "A");
push(@{$subs{"R"}}, "A");
push(@{$subs{"Y"}}, "C");
push(@{$subs{"Y"}}, "T");
push(@{$subs{"K"}}, "G");
push(@{$subs{"K"}}, "T");
push(@{$subs{"S"}}, "C");
push(@{$subs{"S"}}, "G");
push(@{$subs{"W"}}, "A");
push(@{$subs{"W"}}, "T");

open my $IN, "<in.txt";
while(<$IN>)
{
	if($_ =~ /^sitenames/)
	{
		@columns = split(/\s+?/, $_);		
	}
	else
	{
		my @tmp = split(/\s+?/, $_);
		for my $i (0 .. $
		{			
			push(@{$hash{$columns[$i]}}, $tmp[$i]);
		}
	}
}
close $IN;

foreach my $column (sort keys %hash)
{	
	my $compare = "";
	my $b = 1;		
	unless($column eq "sitenames")
	{	
		for my $row (0 .. $
		{			
			if($hash{"sitenames"}[$row] ~~ @samples)
			{								
				unless($hash{$column}[$row] eq "N")
				{
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];					
					}
					if($hash{$column}[$row] ne $compare)
					{
						$b = 0;
						last;
					}					
					$compare = $hash{$column}[$row];					
				}
			}			
		}		
		if($b == 1)
		{
			for my $row (0 .. $
			{				
				unless($hash{"sitenames"}[$row] ~~ @samples)
				{
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];
					}
					if($hash{$column}[$row] eq $compare || $compare ~~ $subs{$hash{$column}[$row]})
					{
						$b = 0;
						last;
					}					
				}
			}
		}		
		if($b == 1)
		{
			print "$column\n";
		}
	}
}

The code itself is also pretty horrible.
We basically make sure the earlier defined samples have different letters.
If they do we make sure the letter in question is also unique among the whole column (taking into account those pesky substitutions).

I'd really like someone with actual Perl work experience to take a look at this problem.

Apparently i can't edit my post so i have to make a new one.
It looks like the following is working. I made some changes to your input file and it appeared to still be working. However this is most likely the worst way to get this done and contains bugs. Also keep in mind i have no idea how this will hold up when tested with your 2 million columns.
That being said here's the code:

use strict;
use warnings "all";

my %hash;
my @columns;

my @samples = qw(sample84 sample85 sample86 sample87 sample88 sample89 sample90 sample240);
my %subs;

push(@{$subs{"M"}}, "A");
push(@{$subs{"M"}}, "C");
push(@{$subs{"R"}}, "A");
push(@{$subs{"R"}}, "A");
push(@{$subs{"Y"}}, "C");
push(@{$subs{"Y"}}, "T");
push(@{$subs{"K"}}, "G");
push(@{$subs{"K"}}, "T");
push(@{$subs{"S"}}, "C");
push(@{$subs{"S"}}, "G");
push(@{$subs{"W"}}, "A");
push(@{$subs{"W"}}, "T");

open my $IN, "<in.txt";
while(<$IN>)
{
	if($_ =~ /^sitenames/)
	{
		@columns = split(/\s+?/, $_);		
	}
	else
	{
		my @tmp = split(/\s+?/, $_);
		for my $i (0 .. $
		{			
			push(@{$hash{$columns[$i]}}, $tmp[$i]);
		}
	}
}
close $IN;

foreach my $column (sort keys %hash)
{	
	my $compare = "";
	my $b = 1;		
	unless($column eq "sitenames")
	{	
		for my $row (0 .. $
		{			
			if($hash{"sitenames"}[$row] ~~ @samples)
			{								
				unless($hash{$column}[$row] eq "N")
				{
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];					
					}
					if($hash{$column}[$row] ne $compare)
					{
						$b = 0;
						last;
					}					
					$compare = $hash{$column}[$row];					
				}
			}			
		}		
		if($b == 1)
		{
			for my $row (0 .. $
			{				
				unless($hash{"sitenames"}[$row] ~~ @samples)
				{
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];
					}
					if($hash{$column}[$row] eq $compare || $compare ~~ $subs{$hash{$column}[$row]})
					{
						$b = 0;
						last;
					}					
				}
			}
		}		
		if($b == 1)
		{
			print "$column\n";
		}
	}
}

The code itself is also pretty horrible.
We basically make sure the earlier defined samples have different letters.
If they do we make sure the letter in question is also unique among the whole column (taking into account those pesky substitutions).

I'd really like someone with actual Perl work experience to take a look at this problem.

replic you are beautiful!!! I will give this script a try and let you know how it goes. Whether it works or not, you are pretty much my hero for helping me! Thank you!!

I just noticed i accidentally clipped the code, my bad.

use strict;
use warnings "all";

my %hash;
my @columns;
# these are the samples we are interested in
my @samples = qw(sample84 sample85 sample86 sample87 sample88 sample89 sample90 sample240);
my %subs;

push(@{$subs{"M"}}, "A");
push(@{$subs{"M"}}, "C");
push(@{$subs{"R"}}, "A");
push(@{$subs{"R"}}, "A");
push(@{$subs{"Y"}}, "C");
push(@{$subs{"Y"}}, "T");
push(@{$subs{"K"}}, "G");
push(@{$subs{"K"}}, "T");
push(@{$subs{"S"}}, "C");
push(@{$subs{"S"}}, "G");
push(@{$subs{"W"}}, "A");
push(@{$subs{"W"}}, "T");

open my $IN, "<in.txt";
while(<$IN>)
{
	if($_ =~ /^sitenames/)
	{
		@columns = split(/\s+?/, $_);		
	}
	else
	{
		my @tmp = split(/\s+?/, $_);
		for my $i (0 .. $#tmp)
		{			
			push(@{$hash{$columns[$i]}}, $tmp[$i]);
		}
	}
}
close $IN;

# for every column (site)
foreach my $column (sort keys %hash)
{	
	my $compare = "";
	my $b = 1;	
	# don't check the column that contains the sample names
	unless($column eq "sitenames")
	{	
		for my $row (0 .. $#{$hash{$column}})
		{
			# only look at the samples we defined earlier
			if($hash{"sitenames"}[$row] ~~ @samples)
			{				
				# N means no data so we skip it
				unless($hash{$column}[$row] eq "N")
				{
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];					
					}
					# if the current letter is different from the last one we skip too
					# if($hash{$column}[$row] eq "N" && $compare ne "N")
					# {
						# next;
					# }
					if($hash{$column}[$row] ne $compare)
					{
						$b = 0;
						last;
					}					
					$compare = $hash{$column}[$row];					
				}
			}			
		}		
		# only do the 2nd step if the first one was a success
		# now make sure the letter is unique in this column excluding our defined samples
		if($b == 1)
		{
			for my $row (0 .. $#{$hash{$column}})
			{
				# only look at the other samples now				
				unless($hash{"sitenames"}[$row] ~~ @samples)
				{
					# abort if the letter is not unique
					# check the normal letter first			
					if($compare eq "")
					{
						$compare = $hash{$column}[$row];
					}
					if($hash{$column}[$row] eq $compare || $compare ~~ $subs{$hash{$column}[$row]})
					{
						$b = 0;
						last;
					}
					# now check for substitutes					
				}
			}
		}
		# if everything went well we can print the respective column
		if($b == 1)
		{
			print "$column\n";
		}
	}
}

Edited 4 Years Ago by replic: n/a

Comments
Looks like a lot of work but seems to run OK on the sample data.

Hi replic,
Two things - first, your script worked beautifully, thank you for writing it for me!
Second, I'm writing up my thesis and would like to add you to the acknowledgements section, but my supervisor doesn't want me to use your forum username. If you don't mind, could I have your real name please? You can email me at hmbird-at-ualberta-dot-ca, if you would like to keep it off the forum.
Cheers

This question has already been answered. Start a new discussion instead.