I am very new to perl and linux systems in general, so I am hoping that someone can help with this problem!
I have a very large table of data (100 rows and between 300000 to 2 million columns) that looks like this:
sitenames site1 site2 site3 site4 site5 sample10 A A N G T sample22 N N C G T sample35 N A M G N sample47 A A N N T sample84 G G A C N sample240 N A N C N
I want a list of sites that are synapomorphic (have letters that occur in only certain samples and no others) for sample84, sample85, sample86, sample87, sample88, sample89, sample90 and sample240. N means missing data, M means it is a A or a C, R is A or G, Y is C or T, K is G or T, S is C or G, and W is A or T.
So the best way that I could figure to do this was to look at each column in turn, first see if samples 84-90 and 240 all have the same letter call (N or A, N or G, N or C, and N or T) but if they have more than one letter call (A and G and C and T) then delete the column.
If they were the same, then look at the rest of the samples and see if any have the same letter call. If any do, delete the column, but if none do, spit out the column name (site#) to a new file, and append as it moves through the columns.
So essentially the output for the above set would look like:
Hopefully that makes sense. I have no idea how to turn this into a perl script, or where to start! Please help!