I am very very new to perl and am not even sure if awk is what I should be using instead. Hopefully someone out there can help me put together a script for this problem. Here is some sample data (it is tab delineated in my file):

20 scaffold189_125 13634 13384 13884 D aga-miR-1890_MI -2 1.28e-01
20 scaffold189_125 17953 17703 18203 D aga-miR-1890_MI -2 1.28e-01
22 scaffold208_17_ 4663 4413 4913 D aga-miR-315_MIM -2 9.75e-03
20 scaffold1052_5_ 11884 11634 12134 D aga-miR-34_MIMA -1 4.34e-03
20 scaffold1052_5_ 11705 11455 11955 D aga-miR-73_MIMA -2 6.32e-02
20 scaffold208_22_ 4742 4498 4998 D aga-miR-24_MIMA -1 3.29e-01

I want to take a file with several columns. I want the data to sort by column initially to determine all values which are the same in column 2 and group them accordingly. Any non-grouped lines should not be deleted.

20 scaffold189_125 13634 13384 13884 D aga-miR-1890_MI -2 1.28e-01
20 scaffold189_125 17953 17703 18203 D aga-miR-1890_MI -2 1.28e-01

20 scaffold1052_5_ 11884 11634 12134 D aga-miR-34_MIMA -1 4.34e-03
20 scaffold1052_5_ 11705 11455 11955 D aga-miR-73_MIMA -2 6.32e-02

notice that scaffold208_17 and scaffold208_22 are not the same.

Then, for each set with matching values in column 2, I want the program to take the values in columns 4 and 5 and take those values as a range. I want it to then scan all the values in column 3, within a group with a matching column 2, and determine if the value in column 3 falls in that first range of the group. If it does fall within the range, I want it deleted, if it does not then it stays. I want this to loop for every line with a matching column2 - checking to see if the value in column 3 falls within any of the ranges of the other lines with a matching column 2. This will avoid redundancy in my data set. I then need the program to do this for every grouping with matching column 2.

The output would include any lines not part of a group (such as scaffold208_17) and non-redundant lines belonging to a group

for the set above:

20 scaffold189_125 13634 13384 13884 D aga-miR-1890_MI -2 1.28e-01
20 scaffold189_125 17953 17703 18203 D aga-miR-1890_MI -2 1.28e-01
22 scaffold208_17_ 4663 4413 4913 D aga-miR-315_MIM -2 9.75e-03
20 scaffold1052_5_ 11884 11634 12134 D aga-miR-34_MIMA -1 4.34e-03
20 scaffold208_22_ 4742 4498 4998 D aga-miR-24_MIMA -1 3.29e-01

Many thanks!

A quick clarification... you said " If it does fall within the range, I want it deleted, if it does not then it stays." When I wrote this code, it didn't find records that matched this condition, did you mean " If it does NOT fall within the range, I want it deleted, if it does then it stays."?

Because if I remove it if it doesn't fall in the range, things work fine. If I remove it if it falls in the range, no data.

use strict;
use warnings;

my %hash;
open(IN,"<biodata.txt");
while(<IN>){
	chomp;
	my(@cols)=split(/\t/);
	push @{$hash{$cols[1]}},$_;
}
close IN;
open (OUT, ">biooutput.txt");
for (keys %hash){
	my $count = @{$hash{$_}};
	if($count>1){
		for (@{$hash{$_}}){
			my (@cols)=split(/\t/);
			if(($cols[2]>=$cols[3] && $cols[2]<=$cols[4]) || ($cols[2]>=$cols[4] && $cols[2]<=$cols[3])){
				#the condition above is for it falling in the range
				# if you want it NOT to fall in the range use
				# if(!($cols[2]>=$cols[3] && $cols[2]<=$cols[4]) || ($cols[2]>=$cols[4] && $cols[2]<=$cols[3])){
				print OUT "$_\n";
			}
		}
	}
}
close OUT;

biooutput.txt:

20	scaffold189_125	13634	13384	13884	D	aga-miR-1890_MI	-2	1.28e-01
20	scaffold189_125	17953	17703	18203	D	aga-miR-1890_MI	-2	1.28e-01
20	scaffold1052_5_	11884	11634	12134	D	aga-miR-34_MIMA	-1	4.34e-03
20	scaffold1052_5_	11705	11455	11955	D	aga-miR-73_MIMA	-2	6.32e-02

BTW, you "tab delimited" file is not completely tab delimited. I had to fix that.

And yes, I used an anonymous array held in a hash. Sorry if it's complex, but it's efficient.

Edited 6 Years Ago by mitchems: n/a

Thanks so much for helping me out! Unfortunately, the code isn't working for me. I do need it to only give me the lines for which column 3 does not fall within the range of any other line item with the same value in column 2. Any ideas on how to tweak this? The sample data and output should satisfy the different scenarios.

by adding an "else print" could we keep from eliminating the lines for which there are only 1 occurrance in column 2?

Edited 6 Years Ago by bio-grad: n/a

OK, here's what I don't understand... maybe I don't get the "range" idea... Your desired output in your original message is:

20      scaffold189_125	13634	13384	13884	D	aga-miR-1890_MI	-2	1.28e-01
20	scaffold189_125	17953	17703	18203	D	aga-miR-1890_MI	-2	1.28e-01
22	scaffold208_17_	4663	4413	4913	D	aga-miR-315_MIM	-2	9.75e-03
20	scaffold1052_5_	11884	11634	12134	D	aga-miR-34_MIMA	-1	4.34e-03
20      scaffold208_22_ 4742    4498    4998    D       aga-miR-24_MIMA -1      3.29e-01

So what does it mean by "in range" - I thought it meant that the value of col#3 is between the values of col#4 and col#5. Am I wrong? If I am right, then the data will not match anything. All of the values of col#3 in your sample output are between the values of cols 4 & 5. Correct?

I need more information. As for my code not workings, one thing you will need to do is fix your original input file to actually BE tab-delimited and name it biodata.text and put it in the directory in which you run the script.

I'd appreciate it if you'd let me know what "not working" means.

And how are these lines:

22   scaffold208_17_ 4663 4413 4913 D aga-miR-315_MIM -2 9.75e-03
20   scaffold208_22_ 4742 4498 4998 D aga-miR-24_MIMA -1 3.29e-01

"Grouped"? You explained your problem saying that you wanted to remove any lines that are not grouped - which to me meant, based on your post, those lines that didn't have the same value as any other line in col#2 - correct? Those two lines do not have another column in which the value is the same as other lines in the file.

I think you need to clarify what you actually want.

Okay, let's see if I can explain this any better. Alright, the scenario is kind of like if column 2 were the street name and column 3 the house number. Columns 4 and 5 are values 250 to either side of that in column 3 - creating a kind of perimeter around the "house". I am interested in two things - 1) houses on the same street that are >250 away from each other (as in your script where $count>1 and none of the values in column 3 in that group with the same column 2 value fall within 250 of each other ie. the range shown in column 4 and 5) or 2) if they are on a street by themselves (value in column 2 $count==1).

What "didn't work for me" was that the script doesn't provide the listings where column 2 (the street) is listed only once and the script tells me when the values fall within the range rather than when they are outside of each others ranges.

Explanation of input vs. output (I'll refer to them by input line #):
line #1 and #2 have the same value in column 2 as does #4 and #5. Therefore lines #3 and #6 should go straight to print in the new file.
Next #1 and #2: the range for #1 is 13384 - 13884; the column 3 value for #2 is 17953 which is not within the range. Therefore both #1 and #2 are now printed to the file.
Next #4 and #5: the range for #4 is 11634 - 12134; the column 3 value for #5 is 11705 which is within the range. Therefore #4 is printed while #5 is not printed to the new file as it falls within the range of another with the same column 2 value.

There will be more than two listings with the same value in column 2 (as many as 200) so I need to make sure that this will loop and do the range comparison of each listing to the column 3 value of each listing to make sure there is no overlap. Where there is overlap, only one instance is printed to the new file.

Edited 6 Years Ago by bio-grad: n/a

Explanation of input vs. output (I'll refer to them by input line #):

line #1 and #2 have the same value in column 2 as does #4 and #5. Therefore lines #3 and #6 should go straight to print in the new file.

Mike: OK, there's one thing I didn't understand from your original request, the idea that "ungrouped" lines go directly into the new file and do not go into the trashbin. That's what I am doing with the $count>1 thing.

Next #1 and #2: the range for #1 is 13384 - 13884; the column 3 value for #2 is 17953 which is not within the range. Therefore both #1 and #2 are now printed to the file.

Mike: OK, so you want to print those in which the second "grouped" file's col#3 is compared against the first's cols4 and 5. I didn't get that either.


Next #4 and #5: the range for #4 is 11634 - 12134; the column 3 value for #5 is 11705 which is within the range. Therefore #4 is printed while #5 is not printed to the new file as it falls within the range of another with the same column 2 value.


Mike: ok, now I get it. None of that should be too tough BTW. I'll get back to you shortly.

OK, I think this will work. The only problem I see with it is that the output file is not in the same ORDER as the input file. That's because of the keys function, I believe...

use strict;
use warnings;
my %hash;
open(IN,"<biodata.txt");
while(<IN>){
	chomp;
	my(@cols)=split(/\t/);
	push @{$hash{$cols[1]}},$_;
}
close IN;
open (OUT, ">biooutput.txt");
for (keys %hash){
	my $count = @{$hash{$_}};
	if($count>1){
		my $x=0;
		my @hcols;
		for (@{$hash{$_}}){
			$x++;
			if($x==1){
				(@hcols)=split(/\t/);
				 next;
			}else{
				my (@cols)=split(/\t/);
				if(!($cols[2]>=$hcols[3] && $cols[2]<=$hcols[4]) || ($cols[2]>=$hcols[4] && $cols[2]<=$hcols[3])){
					my $line1=join("\t",@hcols);
					print OUT "$line1\n$_\n";
				}else{
					print OUT "$_\n";
				}
			}
		}
	}else{
			print OUT "@{$hash{$_}}[0]\n";
	}
}
close OUT;

Thanks so much! The order doesn't matter over much because these will all be put through another program next.

okay, not sure what the trouble is - the code worked just like I wanted it to on the sample data where there were max 2 comparisons. When I ran it with more, it starts to give me multiple copies of a single line, as though it compares and then prints, but doesn't take note as to whether it was already added. Maybe it needs to not print until it has been shown that it doesn't fall within the range rather than after each comparison? Anyhow, it will make better sense with more data so here's a larger sample set and the output the above posted code gave me:

biodata.txt
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
12 scaffold534_10_ 1391 1141 1641 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_10_ 1525 1275 1775 D cin-miR-4218-5p -2 4.62e-01
16 scaffold534_10_ 5765 5515 6015 D mmu-miR-546_MIM -2 2.07e+01
21 scaffold534_10_ 6625 6375 6875 D ath-miR414_MIMA -2 3.54e-02
12 scaffold534_13_ 1969 1719 2219 D mdv1-miR-M3*_MI -2 2.92e+03
18 scaffold534_15_ 208 -42 458 D cin-miR-4194-3p -2 1.65e+00
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16182 15932 16432 D gma-miR1533_MIM -2 4.62e-01
19 scaffold534_16_ 16185 15935 16435 D gma-miR1533_MIM -2 4.62e-01
19 scaffold534_16_ 16188 15938 16438 D gma-miR1533_MIM -2 4.62e-01
19 scaffold534_16_ 16191 15941 16441 D gma-miR1533_MIM -2 4.62e-01
19 scaffold534_16_ 16194 15944 16444 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 17672 17422 17922 D mdv1-miR-M3*_MI -2 2.92e+03

biooutput.txt
12 scaffold534_13_ 1969 1719 2219 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16182 15932 16432 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16185 15935 16435 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16188 15938 16438 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16191 15941 16441 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16194 15944 16444 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_16_ 17672 17422 17922 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
12 scaffold534_10_ 1391 1141 1641 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
19 scaffold534_10_ 1525 1275 1775 D cin-miR-4218-5p -2 4.62e-01
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
16 scaffold534_10_ 5765 5515 6015 D mmu-miR-546_MIM -2 2.07e+01
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
21 scaffold534_10_ 6625 6375 6875 D ath-miR414_MIMA -2 3.54e-02
18 scaffold534_15_ 208 -42 458 D cin-miR-4194-3p -2 1.65e+00

As you can see, I'm getting more in the output than what I started with. Anyone have a quick way to solve this problem?

Here is the optimal output:
12 scaffold534_13_ 1969 1719 2219 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_16_ 8087 7837 8337 D mdv1-miR-M3*_MI -2 2.92e+03
19 scaffold534_16_ 16182 15932 16432 D gma-miR1533_MIM -2 4.62e-01
12 scaffold534_16_ 17672 17422 17922 D mdv1-miR-M3*_MI -2 2.92e+03
12 scaffold534_10_ 147 -103 397 D mdv1-miR-M3*_MI -1 1.77e+02
19 scaffold534_10_ 1525 1275 1775 D cin-miR-4218-5p -2 4.62e-01
16 scaffold534_10_ 5765 5515 6015 D mmu-miR-546_MIM -2 2.07e+01
21 scaffold534_10_ 6625 6375 6875 D ath-miR414_MIMA -2 3.54e-02
18 scaffold534_15_ 208 -42 458 D cin-miR-4194-3p -2 1.65e+00

Edited 6 Years Ago by bio-grad: n/a

Please try this approach.

#!/usr/bin/perl
use strict;
use warnings;

my (@in, @out);
while(<DATA>){
    chomp;
    push @in, [split(/\t/)];#Build an array of arrays
}

my $prev_aref;
foreach my $aref (@in){#foreach array reference
    if (!defined($prev_aref)
        or $$aref[1] ne $$prev_aref[1]
        or abs($$aref[2] - $$prev_aref[2]) >= 250){#At least 250 away from prev loc
        push @out, $aref;
        $prev_aref = $aref;
    }
}

@out = sort my_sort @out;

foreach my $aref (@out){
    print join "\t", @$aref, "\n";
}

sub my_sort{
    my $r = $$a[1] cmp $$b[1]; #Compare group
    if ($r == 0) { #If group in same group
        $r = $$a[2] <=> $$b[2]; #Compare location
    }
    return $r;
}
__DATA__
12	scaffold534_10_	147	-103	397	D	mdv1-miR-M3*_MI	-1	1.77e+02
12	scaffold534_10_	1391	1141	1641	D	mdv1-miR-M3*_MI	-2	2.92e+03
19	scaffold534_10_	1525	1275	1775	D	cin-miR-4218-5p	-2	4.62e-01
16	scaffold534_10_	5765	5515	6015	D	mmu-miR-546_MIM	-2	2.07e+01
21	scaffold534_10_	6625	6375	6875	D	ath-miR414_MIMA	-2	3.54e-02
12	scaffold534_13_	1969	1719	2219	D	mdv1-miR-M3*_MI	-2	2.92e+03
18	scaffold534_15_	208	-42	458	D	cin-miR-4194-3p	-2	1.65e+00
12	scaffold534_16_	8087	7837	8337	D	mdv1-miR-M3*_MI	-2	2.92e+03
19	scaffold534_16_	16182	15932	16432	D	gma-miR1533_MIM	-2	4.62e-01
19	scaffold534_16_	16185	15935	16435	D	gma-miR1533_MIM	-2	4.62e-01
19	scaffold534_16_	16188	15938	16438	D	gma-miR1533_MIM	-2	4.62e-01
19	scaffold534_16_	16191	15941	16441	D	gma-miR1533_MIM	-2	4.62e-01
19	scaffold534_16_	16194	15944	16444	D	gma-miR1533_MIM	-2	4.62e-01
12	scaffold534_16_	17672	17422	17922	D	mdv1-miR-M3*_MI	-2	2.92e+03

This gives the following output.

12	scaffold534_10_	147	-103	397	D	mdv1-miR-M3*_MI	-1	1.77e+02	
12	scaffold534_10_	1391	1141	1641	D	mdv1-miR-M3*_MI	-2	2.92e+03	
16	scaffold534_10_	5765	5515	6015	D	mmu-miR-546_MIM	-2	2.07e+01	
21	scaffold534_10_	6625	6375	6875	D	ath-miR414_MIMA	-2	3.54e-02	
12	scaffold534_13_	1969	1719	2219	D	mdv1-miR-M3*_MI	-2	2.92e+03	
18	scaffold534_15_	208	-42	458	D	cin-miR-4194-3p	-2	1.65e+00	
12	scaffold534_16_	8087	7837	8337	D	mdv1-miR-M3*_MI	-2	2.92e+03	
19	scaffold534_16_	16182	15932	16432	D	gma-miR1533_MIM	-2	4.62e-01	
12	scaffold534_16_	17672	17422	17922	D	mdv1-miR-M3*_MI	-2	2.92e+03

looks great! thanks d5e5!

You're welcome. After posting that I realised I had assumed the input data had already been sorted by group and location, as your latest input data appeared to be. If future input data does not come in that sequence you will need to add the following line after @in has been built, and before processing it. @in = sort my_sort @in;

This question has already been answered. Start a new discussion instead.