Remove Duplicate Lines from a File

KevinADC 1 Tallied Votes 1K Views Share

A bare-bones code snippet to remove duplicate lines from a file. There are a number of ways to accomplish this task but this is a fast and dependable method using perls inplace editor and a simple hash to get the job done.

This probably should not be used for really big files, but files with a few thousand lines or even a few tens of thousands of lines should be OK. The bigger the file, the longer it may take to run.

d5e5 commented: I was looking for an example of using inplace editing without having to put the filename on the command line and there it was: local anonymous block and local @ARGV. Thanks. +1
#!/usr/bin/perl

use strict;
use warnings;

my $file = '/path/to/file.txt';
my %seen = ();
{
   local @ARGV = ($file);
   local $^I = '.bac';
   while(<>){
      $seen{$_}++;
      next if $seen{$_} > 1;
      print;
   }
}
print "finished processing file.";
MattEvans 473 Veteran Poster Team Colleague Featured Poster

That's madly tiny and cryptic :cheesy:

I've been working in Java today for a college project; I used Java for yeaars before I used Perl..

To do something similar to that in Java would be a mammoth task. There seems to be 'no such thing' as a useful Java hash, and reading files line by line isn't made easy either.

I certainly prefer the Perl way these days...

KevinADC 192 Practically a Posting Shark

It probably is a bit cryptic. But code is that way if you don't understand the syntax of a particular language. It could be written very cryptically as a one-liner. Something like (unchecked for accuracy):

perl -i.bac -ne "next if ++$seen{$_}>1; print;" file.txt
MattEvans 473 Veteran Poster Team Colleague Featured Poster

Perl is quite bad (or good depending on how you look at it) for crypticness.

I never use those superglobal variables as implicit parameters or targets; it scares me ;)

But I'd much rather be scared by something powerful at my potential disposal than irritated by the overhead and safety checks involved in doing alot of conceptually simple things in Java...

I guess they certainly aren't languages for the same purpose.. But hey; my college project involves string processing, and could definately make good use of untyped hashes, and it's gotta be done in Java. :mad:

Rashakil Fol 978 Super Senior Demiposter Team Colleague

I prefer sort | uniq

hindlem 0 Newbie Poster

Took me < 5 min in Java? What are you talking about MattEvans? Besides, you could do this even easier in Haskell!

Set<String> lines = new HashSet<String>();

BufferedWriter bw = new BufferedWriter(new FileWriter(args[1].toString()));
BufferedReader br = new BufferedReader(new FileReader(args[0].toString()));
while(br.ready()) {
	String line = br.readLine();
	if (lines.contains(line))  {
		bw.write(line);
		bw.newLine();
	 } else {
		lines.add(line);
	 }
}
hindlem 0 Newbie Poster

Shocking mistake there. :-) It should read...

if (!lines.contains(line))  {
	bw.write(line);
	bw.newLine();
	lines.add(line);
}
sensamurai 0 Newbie Poster

@uniq = grep !$list{$_}++, @array;

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.