I have been beating my head into the desk with this issue, and I don't think it's a simple 'uniq' or 'sort' issue.

I have a file, with many duplicate values in them.



The output I am looking for would only have the following from the above file:



Everything I've found so far either removes all the dupes, and keeping one copy of 'owl' or 'dog', which is not what I need. If it's duplicated, I don't want it at all in the output. The file I have is one I've merged from two other files, and they have nearly 50,000 lines in each one, so you can understand why it takes so long.

Edited by brakeb: bad spelling

4 Years
Discussion Span
Last Post by brakeb

I could do this simply in C, C++, or Java. You keep a map of where the key is the data, and the value is the number of times you have seen it. With each input, you look up the data in the map. If not found, then you add a new item with a value of 1. If found, you can increment the value. When done reading the data, you walk through the map, and only output those with a value of 1.

Maps are normal constructs for C++ and Java. For C you would use a structure with a character array (or pointer) for the data, and an integer for the value, and use an array of these structs to substitute for a map.


So it's going to be way more complicated than I thought :( Guess I'll stick with the spreadsheet method. I'm under a bit of a time crunch.

Thank you anyway.


You can easily do this in a shell. If your file is named foo.in then you would get what you want with:

sort foo.in | 
   uniq -c | 
   awk '($1 == 1) {print $2}'

This gives you the entries that occur exactly once in the input file.

Now, depending on how large your file is it may take some time (50K records is not going to be bad at all).

Edited by L7Sqr


My issue was more complicated than I originally had guessed. I figured out a solution that works for me. It's probably a bit convoluted, but it works for me, and that is what's important.

I was doing a firewall ACL audit. My latest ACL list had hitcounts on them, which was throwing off my attempts to compare an older ACL list, which each line had a different hitcount or different line numbers.

Obviously, comparing the lines wouldn't work, as 'diff' and 'comm' see the 'hitcnt' or 'line' number and report it as a different line. So what I did was take and grep out the hashes only for each file:

grep -o '0x[0-9A-Fa-f]\{4,\}' old_acl_list.txt | sort >> hashes_old
grep -o '0x[0-9A-Fa-f]\{4,\}' new_acl_list.txt | sort >> hashes_new

Then I ran 'diff' on those. Found a way for 'diff' to only show uniques from the newest file here: http://www.linuxquestions.org/questions/linux-newbie-8/comparing-two-linux-files-for-diffirences-and-similarities-822245/

diff --changed-group-format='%<' --unchanged-group-format='' hashes_new hashes_old >> final_hashes

and then all I did was grep for each of the unique values in the new acl list.

for line in `cat final_hashes`; do
    grep -i $line new_acl_list.txt >> final_acl_audit.txt


I thought I'd post this, as I have seen too many posts in forums where you see 'nevermind, I figured it out' without any info. Yea, it's not pretty, but if you're doing PCI firewall audits, and have to compare two firewall ACL lists, this will do it...

Votes + Comments
Kudos for giving back the answer!
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.