remove all duplicates (even originals)

Question

brakeb

9 Years Ago

I have been beating my head into the desk with this issue, and I don't think it's a simple 'uniq' or 'sort' issue.

I have a file, with many duplicate values in them.

File

dog
dog
cat
owl
owl
turkey
weasel
giraffe
giraffe
rooster

The output I am looking for would only have the following from the above file:

Output:

cat
turkey
weasel
rooster

Everything I've found so far either removes all the dupes, and keeping one copy of 'owl' or 'dog', which is not what I need. If it's duplicated, I don't want it at all in the output. The file I have is one I've merged from two other files, and they have nearly 50,000 lines in each one, so you can understand why it takes so long.

duplication removal shell-scripting sorting

Edited 9 Years Ago by brakeb because: bad spelling

3 Contributors
4 Replies
193 Views
6 Days Discussion Span
Latest Post 9 Years Ago Latest Post by brakeb

All 4 Replies

rubberman 1,355 Nearly a Posting Virtuoso

9 Years Ago

I could do this simply in C, C++, or Java. You keep a map of where the key is the data, and the value is the number of times you have seen it. With each input, you look up the data in the map. If not found, then you add a new item with a value of 1. If found, you can increment the value. When done reading the data, you walk through the map, and only output those with a value of 1.

Maps are normal constructs for C++ and Java. For C you would use a structure with a character array (or pointer) for the data, and an integer for the value, and use an array of these structs to substitute for a map.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

brakeb · Answer 1 · 2014-05-01T14:10:47+00:00

So it's going to be way more complicated than I thought :( Guess I'll stick with the spreadsheet method. I'm under a bit of a time crunch.

Thank you anyway.

L7Sqr 227 Practically a Master Poster · Answer 2 · 2014-05-03T02:49:05+00:00

You can easily do this in a shell. If your file is named foo.in then you would get what you want with:

sort foo.in | 
   uniq -c | 
   awk '($1 == 1) {print $2}'

This gives you the entries that occur exactly once in the input file.

Now, depending on how large your file is it may take some time (50K records is not going to be bad at all).

brakeb · Answer 3 · 2014-05-06T23:31:16+00:00

My issue was more complicated than I originally had guessed. I figured out a solution that works for me. It's probably a bit convoluted, but it works for me, and that is what's important.

I was doing a firewall ACL audit. My latest ACL list had hitcounts on them, which was throwing off my attempts to compare an older ACL list, which each line had a different hitcount or different line numbers.

Obviously, comparing the lines wouldn't work, as 'diff' and 'comm' see the 'hitcnt' or 'line' number and report it as a different line. So what I did was take and grep out the hashes only for each file:

grep -o '0x[0-9A-Fa-f]\{4,\}' old_acl_list.txt | sort >> hashes_old
grep -o '0x[0-9A-Fa-f]\{4,\}' new_acl_list.txt | sort >> hashes_new

Then I ran 'diff' on those. Found a way for 'diff' to only show uniques from the newest file here: http://www.linuxquestions.org/questions/linux-newbie-8/comparing-two-linux-files-for-diffirences-and-similarities-822245/

diff --changed-group-format='%<' --unchanged-group-format='' hashes_new hashes_old >> final_hashes

and then all I did was grep for each of the unique values in the new acl list.

for line in `cat final_hashes`; do
    grep -i $line new_acl_list.txt >> final_acl_audit.txt

done

I thought I'd post this, as I have seen too many posts in forums where you see 'nevermind, I figured it out' without any info. Yea, it's not pretty, but if you're doing PCI firewall audits, and have to compare two firewall ACL lists, this will do it...

remove all duplicates (even originals)

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers