I'm still fairly new at shell scripting, can someone please show me how I would go about accomplishing a script to tackle these tasks? I've got a server setup.

I want to find duplicate files that are taking up all
the space on the hard drive. I want to group the files by size
and then by their MD5 check sum. Since two files are presumed to be the same if they have the same MD5 check sum.

I want my shell script to generate a list of files along with their location which are identical.

Can someone help? I think this will be very useful.

I actually made a little script once that takes an md5 checksum of most $PATH directories and uploads them to a remote server for intrustion detection. It was more academic than useful but you want the same concept:


if ( ! [ "$1" = "-f" ] ); then
  echo ""
  echo "Edit md5.conf before you proceed"
  echo "once you are ready to install: "
  echo "$0 -f"
  echo ""
  exit 0

ifiles="/usr/sbin/md5check /usr/sbin/md5compare /usr/sbin/md5update /etc/md5.conf"
for i in $ifiles
  if test -f $i; then
    echo "Destination file already exists. Exiting"
    exit 0

cp md5check /usr/sbin
cp md5compare /usr/sbin
cp md5update /usr/sbin
cp md5.conf /etc

chmod 500 /usr/sbin/md5check
chmod 500 /usr/sbin/md5compare
chmod 500 /usr/sbin/md5update
chmod 400 /etc/md5.conf

chown root:root /usr/sbin/md5check
chown root:root /usr/sbin/md5compare
chown root:root /usr/sbin/md5update
chown root:root /etc/md5.conf

for i in $ifiles
  chattr +i $i


# md5 tripwire config

# box hostname
hname=`hostname -s`

# server ip

# server oirt

#login name for remote machine

#directories to search (space delimited)
dsearch="/bin/ /sbin/ /usr/bin/ /usr/sbin/ /lib/ /usr/lib/ /usr/local/ /etc/ /boot/"


source /etc/md5.conf

if test `date +md5-$hname.%Y%m%d.txt`; then
  rm -rf `date +md5-$hname.%Y%m%d.txt`

echo ""
echo "Calculating md5 database"

for dir in $dsearch
   find $dir -type f | xargs /usr/bin/md5sum >> `date +md5-$hname.%Y%m%d.txt`

echo "post installation md5 database calculated"
echo ""


source /etc/md5.conf
if ! [ "$UID" = "0" ]; then
  echo "ERROR: Must be root to run"
  exit 0
scp -P $sport $lname@$sip:~/.$oldfile.tgz . 2>/dev/null
if ( test -f .$oldfile.tgz ); then
  tar -zxf .$oldfile.tgz
rm -rf .$oldfile.tgz
if ( ! test -f $oldfile || [ "$oldfile" = "" ]); then
  echo "Error retrieving md5 database from server"
  exit 0
newfile=`find ./ -iname *md5-$hname*.txt`
if ! test -f $newfile; then
  echo "Error generating new md5 database"
  exit 0
diff $newfile $oldfile > changes
rm -rf md5-$hname* .md5-$hname*
if ( [ `cat changes|wc -l` -eq 0 ] ); then
  echo "No changes were detected. Cleaning up."
  rm -rf changes
  echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
  echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
  echo "      Changes were detected. View _changes_ for details."
  echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
  echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"


source /etc/md5.conf
logname=`date +md5-$hname.%Y%m%d.txt`
mv $logname $cpname
tar czf $cpname.tgz $cpname
echo "Please hit enter to continue."
scp -P $sport $cpname.tgz $lname@$sip:~/.$cpname.tgz 2>/dev/null
echo ""
echo "File copied to remote host"
echo ""
rm -rf $cpname
rm -rf $cpname.tgz

So basically you're script will take files from specified directories, formulate their checksum and send them off to another server as your baseline elements?

Do you think you could help me here, I think my script is much more simplistic, i'm very new.

Can you show me a script that will find duplicate files, group them by size and then by their MD5 check sum? I want the script to then generate a list of files along with their location and either display it in standard output or even to a text file.

Any ideas? You seem pretty advanced.

Well yes but I was thinking you would take those as an example and formulate a solution :P . I'm alright with shell scripting but i'm sure there is a better way than this, but it does what you want:

sk@sk:~$ cd /tmp
sk@sk:/tmp$ mkdir -p ./daniweb/dir1/a ./daniweb/dir1/b ./daniweb/dir2/a ./daniweb/dir2/b
sk@sk:/tmp$ cd daniweb
sk@sk:/tmp/daniweb$ dd if=/dev/zero of=./dir1/a/file1 bs=1024 count=1024 >> /dev/null 2>&1
sk@sk:/tmp/daniweb$ dd if=/dev/zero of=./dir2/b/file1 bs=1024 count=1024 >> /dev/null 2>&1
sk@sk:/tmp/daniweb$ echo "abc123" >> ./dir1/b/file2
sk@sk:/tmp/daniweb$ echo "abc123" >> ./dir2/a/file2
sk@sk:/tmp/daniweb$ rm -fr .tmp
sk@sk:/tmp/daniweb$ for i in `find ./ -type f`; do echo `du ${i} | awk '{ print $1 }'` `md5sum ./dir1/a/file1 | awk '{ print $1 }'` ${i} >> .tmp; done
sk@sk:/tmp/daniweb$ sort -k1 -r -n .tmp
1029 b6d81b360a5672d80c27430f39153e2c ./dir2/b/file1
1029 b6d81b360a5672d80c27430f39153e2c ./dir1/a/file1
1 b6d81b360a5672d80c27430f39153e2c ./dir2/a/file2
1 b6d81b360a5672d80c27430f39153e2c ./dir1/b/file2

And the commands without my prompt:

cd /tmp
mkdir -p ./daniweb/dir1/a ./daniweb/dir1/b ./daniweb/dir2/a ./daniweb/dir2/b
cd daniweb
dd if=/dev/zero of=./dir1/a/file1 bs=1024 count=1024 >> /dev/null 2>&1
dd if=/dev/zero of=./dir2/b/file1 bs=1024 count=1024 >> /dev/null 2>&1
echo "abc123" >> ./dir1/b/file2
echo "abc123" >> ./dir2/a/file2

rm -fr .tmp
for i in `find ./ -type f`; do echo `du ${i} | awk '{ print $1 }'` `md5sum ./dir1/a/file1 | awk '{ print $1 }'` ${i} >> .tmp; done
sort -k1 -r -n .tmp

Excellent, thank you for your help, let me study this and i'll let you know if I have any questions or problems. I appreciate your time.

I also want to make sure users' home directories don't contain
world writable directories, directories owned by other users, or
other potential security problems. I'd like to echo any directory where
one user's home directory can be modified some by another user.

Can someone help me with these additions? I think this would be very important as well.

In fact I have those scripts right next to my md5 generator scripts. If you want to mark this thread solved and start a new thread for your new question I would be more than willing to assist :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.