954,541 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?

Remove Duplicate Lines from a File

By KevinADC on Jan 20th, 2007 5:00 am

A bare-bones code snippet to remove duplicate lines from a file. There are a number of ways to accomplish this task but this is a fast and dependable method using perls inplace editor and a simple hash to get the job done.

This probably should not be used for really big files, but files with a few thousand lines or even a few tens of thousands of lines should be OK. The bigger the file, the longer it may take to run.

#!/usr/bin/perl

use strict;
use warnings;

my $file = '/path/to/file.txt';
my %seen = ();
{
   local @ARGV = ($file);
   local $^I = '.bac';
   while(<>){
      $seen{$_}++;
      next if $seen{$_} > 1;
      print;
   }
}
print "finished processing file.";

That's madly tiny and cryptic :cheesy:

I've been working in Java today for a college project; I used Java for yeaars before I used Perl..

To do something similar to that in Java would be a mammoth task. There seems to be 'no such thing' as a useful Java hash, and reading files line by line isn't made easy either.

I certainly prefer the Perl way these days...

MattEvans
Veteran Poster
Moderator
1,386 posts since Jul 2006
Reputation Points: 522
Solved Threads: 64
 

It probably is a bit cryptic. But code is that way if you don't understand the syntax of a particular language. It could be written very cryptically as a one-liner. Something like (unchecked for accuracy):

perl -i.bac -ne "next if ++$seen{$_}>1; print;" file.txt
KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

Perl is quite bad (or good depending on how you look at it) for crypticness.

I never use those superglobal variables as implicit parameters or targets; it scares me ;)

But I'd much rather be scared by something powerful at my potential disposal than irritated by the overhead and safety checks involved in doing alot of conceptually simple things in Java...

I guess they certainly aren't languages for the same purpose.. But hey; my college project involves string processing, and could definately make good use of untyped hashes, and it's gotta be done in Java. :mad:

MattEvans
Veteran Poster
Moderator
1,386 posts since Jul 2006
Reputation Points: 522
Solved Threads: 64
 

I prefer sort | uniq

Rashakil Fol
Super Senior Demiposter
Team Colleague
2,658 posts since Jun 2005
Reputation Points: 1,135
Solved Threads: 177
 

Took me < 5 min in Java? What are you talking about MattEvans? Besides, you could do this even easier in Haskell!

Set<String> lines = new HashSet<String>();

BufferedWriter bw = new BufferedWriter(new FileWriter(args[1].toString()));
BufferedReader br = new BufferedReader(new FileReader(args[0].toString()));
while(br.ready()) {
	String line = br.readLine();
	if (lines.contains(line))  {
		bw.write(line);
		bw.newLine();
	 } else {
		lines.add(line);
	 }
}
hindlem
Newbie Poster
2 posts since Jan 2009
Reputation Points: 10
Solved Threads: 0
 

Shocking mistake there. :-) It should read...

if (!lines.contains(line))  {
	bw.write(line);
	bw.newLine();
	lines.add(line);
}
hindlem
Newbie Poster
2 posts since Jan 2009
Reputation Points: 10
Solved Threads: 0
 

@uniq = grep !$list{$_}++, @array;

sensamurai
Newbie Poster
5 posts since Jan 2011
Reputation Points: 10
Solved Threads: 1
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You