Most Efficient Way To Copy Files

Question

ihatehippies 11 Junior Poster

12 Years Ago

I'm doing some research to determine the most efficient way to copy files. I've got 3 candidate functions:

#1

# uses generator to load segments to memory
def copy(src, dst, iteration=1000):
   for x in xrange(iteration):
	   def _write(filesrc, filedst):
	      filegen = iter(lambda: filesrc.read(16384),"")
	      try:
		 while True:
		    filedst.write(filegen.next())
	      except StopIteration:
		 pass

	   with open(src, 'rb') as fsrc:
	      with open(dst, 'wb') as fdst:
		 _write(fsrc, fdst)

#2

# loads entire file to memory
def copy2(src, dst, iteration=1000):
   for x in xrange(iteration):
	   with open(src, 'rb') as fsrc:
	      with open(dst, 'wb') as fdst:
		 fdst.write(fsrc.read())

#3

def copy3(src, dst, iteration=1000):
   for x in xrange(iteration):
	   with open(src, 'rb') as fsrc:
	      with open(dst, 'wb') as fdst:
		      for x in iter(lambda: fsrc.read(16384),""):
			 fdst.write(x)

System Environment:
Win 7 64 bit
3gb ram
Intel Core 2 Duo @ 2.0 GHz

The results

when the file size is 1mb, 1000 iterations each:
>>> copy(SRC, DST)
copy took 5.96600008011 seconds
>>> copy2(SRC,DST)
copy2 took 3.85299992561 seconds
>>> copy3(SRC, DST)
copy3 took 5.35699987411 seconds

The most efficient function is the one that loads the file entirely to memory

when the file size is 107mb 5 iterations each:
>>> copy(SRC, DST, 5)
copy took 3.04099988937 seconds
>>> copy2(SRC,DST, 5)
copy2 took 17.0360000134 seconds
>>> copy3(SRC, DST, 5)
copy3 took 2.2429997921 seconds

Loading the file entirely to memory is now the slowest by far

I thought the results were interesting, if anyone has a more efficient function feel free to contribute

python

3 Contributors
4 Replies
6K Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by ihatehippies

Gribouillis 1,391 Programming Explorer

12 Years Ago

What about copyfile(), copy() and copy2() in module shutil ?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ihatehippies 11 Junior Poster · Answer 1 · 2012-03-04T22:28:47+00:00

I increased the blocksize from read(16384) to read(100000) and it seemed to reduce processing time by about 15-20%

I modified copy1 slightly

@bench
def copy1(src, dst, iteration=1000):
	
   def _write(filesrc, filedst):
      filegen = iter(lambda: filesrc.read(100000),"")
      try:
	 while True:
	    filedst.write(filegen.next())
      except StopIteration:
	 pass
   for x in xrange(iteration):

	   with open(src, 'rb') as fsrc:
	      with open(dst, 'wb') as fdst:
		 _write(fsrc, fdst)

Here are the results including the 3 shutil functions

Filesize: 107mb
Iterations: 7 each
System Environment:
win 7, 3gb ram, core 2 duo @ 2.0ghz

benchmark(7)
# copy1 uses a generator
copy1 took 2.72599983215 seconds <-------------- SECOND
# copy2 loads entire file to memory
copy2 took 21.0590000153 seconds <------------- SLOWEST
# copy3 uses an iterator
copy3 took 2.67499995232 seconds <-------------- FASTEST
# shutil functions
shutil_copy took 5.40400004387 seconds
shutil_copy2 took 7.40000009537 seconds
shutil_copyfile took 4.68000006676 seconds

copy3 is still the fastest. Shutil is slow for a few reasons. It dynamically recalculates the block size (not a factor with only 7 iterations) and it passes the objects to other functions which consumes more memory. Plus the blocksize is too small for today's computers. The iterator was consistently slightly faster than the generator, but there is probably a more efficient way to write both

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 2 · 2012-03-04T23:36:06+00:00

Here are my findings; `mmap` is wicked fast for large file sizes if they can completely fit in memory. Unfortunately, having a 32 bit machine limited by testing capabilities.

My config:

Windows XP 32 bit 3GiB RAM
  Intel i3 3.07 GHz
  SATA-II 3 GiB/s; 7200 RPM
  Python 2.7.2

My output:

Input file size: 348 MiB; iteration count: 20.0
        Function name: shutil_copy; total time: 78.0966309964 sec; one iteration: 3.90483154982
        Function name: mmap_copy; total time: 11.2840840173 sec; one iteration: 0.564204200866
        Function name: mmap_copy_chunked; total time: 225.959542146 sec; one iteration: 11.2979771073
        Function name: copy1; total time: 78.7868789942 sec; one iteration: 3.93934394971
        Function name: copy2; total time: 73.7354360477 sec; one iteration: 3.68677180238
        Function name: copy3; total time: 100.388904075 sec; one iteration: 5.01944520376
Input file size: 15 MiB; iteration count: 100.0
        Function name: shutil_copy; total time: 2.5043987676 sec; one iteration: 0.025043987676
        Function name: mmap_copy; total time: 2.87714018718 sec; one iteration: 0.0287714018718
        Function name: mmap_copy_chunked; total time: 30.966067508 sec; one iteration: 0.30966067508
        Function name: copy1; total time: 2.38466791661 sec; one iteration: 0.0238466791661
        Function name: copy2; total time: 18.5943323786 sec; one iteration: 0.185943323786
        Function name: copy3; total time: 2.03054891633 sec; one iteration: 0.0203054891633

Code:

import mmap
import os
import shutil

# uses generator to load segments to memory
def copy1(src, dst):
    def _write(filesrc, filedst):
        filegen = iter(lambda: filesrc.read(16384),"")
        try:
            while True:
                filedst.write(filegen.next())
        except StopIteration:
            pass

    with open(src, 'rb') as fsrc:
        with open(dst, 'wb') as fdst:
            _write(fsrc, fdst)

# loads entire file to memory
def copy2(src, dst):
    with open(src, 'rb') as fsrc:
        with open(dst, 'wb') as fdst:
            fdst.write(fsrc.read())

def copy3(src, dst):
    with open(src, 'rb') as fsrc:
        with open(dst, 'wb') as fdst:
            for x in iter(lambda: fsrc.read(16384),""):
                fdst.write(x)


def mmap_copy(infname, outfname):
    with open(infname, 'r+b') as inf, open(outfname, 'a+b') as outf:
        outf.write('0')
        outf.flush()
        infile = mmap.mmap(inf.fileno(), length=0, access=mmap.ACCESS_READ)
        outfile = mmap.mmap(outf.fileno(), length=0, offset=0, access=mmap.ACCESS_WRITE)
        sz = infile.size()
        bytes = infile.read(sz)
        outfile.resize(sz)
        outfile.write(bytes)
        outfile.close()
        infile.close()

def mmap_copy_chunked(infname, outfname):
    chunk_size = 16 * 1024
    with open(infname, 'r+b') as inf, open(outfname, 'a+b') as outf:
        outf.write('0')
        outf.flush()
        infile = mmap.mmap(inf.fileno(), length=0, access=mmap.ACCESS_READ)
        outfile = mmap.mmap(outf.fileno(), length=0, offset=0, access=mmap.ACCESS_WRITE)
        sz = 0
        while 1:
            bytes = infile.read(chunk_size)
            if not bytes:
                break
            sz += len(bytes)
            outfile.resize(sz)
            outfile.write(bytes)
        outfile.close()
        infile.close()

def shutil_copy(infname, outfname):
    shutil.copyfileobj(open(infname, 'rb'), open(outfname, 'wb'))

def main():
    from timeit import Timer
    for infname, times in [('video.mkv', 20.0), ('a.exe', 100.0)]:
        sz = os.path.getsize(infname)
        print 'Input file size: %s MiB; iteration count: %s' % ((sz / (1024 * 1024)), times)
        outfname = infname + '.tmp'
        for func in ['shutil_copy', 'mmap_copy', 'mmap_copy_chunked', 'copy1', 'copy2', 'copy3']:
            t = Timer('%s("%s", "%s")' % (func, infname, outfname), 'from __main__ import mmap_copy_chunked, mmap_copy, copy1, copy2, copy3, shutil_copy')
            total_time = t.timeit(number=int(times))
            print '\tFunction name: %s; total time: %s sec; one iteration: %s' % (func, total_time, total_time / times)

if __name__ == '__main__':
    main()

IMO, `shutil` is a decent compromise since it delivers consistent performance across file sizes. If you still have speed issues, trying out `mmap` based solution for large files might be an option.

ihatehippies 11 Junior Poster · Answer 3 · 2012-03-05T00:38:07+00:00

With a larger block size copy1 and copy3 are significantly faster than shutil in the large file category. Here's what I got

Input file size: 104 MiB; iteration count: 20.0
	Function name: shutil_copy; total time: 19.107860699 sec; one iteration: 0.955393034948
	Function name: mmap_copy; total time: 10.1346310338 sec; one iteration: 0.506731551691
	Function name: copy1; total time: 6.04644839464 sec; one iteration: 0.302322419732
	Function name: copy3; total time: 6.07191979681 sec; one iteration: 0.303595989841
Input file size: 1086 MiB; iteration count: 2.0
	Function name: shutil_copy; total time: 156.421775246 sec; one iteration: 78.2108876228
[Error 8] Not enough storage is available to process this command
	Function name: copy1; total time: 143.615497448 sec; one iteration: 71.8077487238
	Function name: copy3; total time: 135.97764655 sec; one iteration: 67.9888232748
Input file size: 0 MiB; iteration count: 500
	Function name: shutil_copy; total time: 0.815837531761 sec; one iteration: 0.00163167506352
	Function name: mmap_copy; total time: 0.567309099254 sec; one iteration: 0.00113461819851
	Function name: copy1; total time: 0.498243360426 sec; one iteration: 0.000996486720852
	Function name: copy3; total time: 0.87309633203 sec; one iteration: 0.00174619266406

import mmap
import os
import shutil

# uses generator to load segments to memory
def copy1(src, dst):
    def _write(filesrc, filedst):
        filegen = iter(lambda: filesrc.read(100000),"")
        try:
            while True:
                filedst.write(filegen.next())
        except StopIteration:
            pass

    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
         _write(fsrc, fdst)

# loads entire file to memory
def copy2(src, dst):
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
         fdst.write(fsrc.read())

def copy3(src, dst):
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
         for x in iter(lambda: fsrc.read(100000),""):
             fdst.write(x)


def mmap_copy(infname, outfname):
    with open(infname, 'r+b') as inf, open(outfname, 'a+b') as outf:
        outf.write('0')
        outf.flush()
        infile = mmap.mmap(inf.fileno(), length=0, access=mmap.ACCESS_READ)
        outfile = mmap.mmap(outf.fileno(), length=0, offset=0, access=mmap.ACCESS_WRITE)
        sz = infile.size()
        bytes = infile.read(sz)
        outfile.resize(sz)
        outfile.write(bytes)
        outfile.close()
        infile.close()

##def mmap_copy_chunked(infname, outfname):
##    chunk_size = 16 * 1024
##    with open(infname, 'r+b') as inf, open(outfname, 'a+b') as outf:
##        outf.write('0')
##        outf.flush()
##        infile = mmap.mmap(inf.fileno(), length=0, access=mmap.ACCESS_READ)
##        outfile = mmap.mmap(outf.fileno(), length=0, offset=0, access=mmap.ACCESS_WRITE)
##        sz = 0
##        while 1:
##            bytes = infile.read(chunk_size)
##            if not bytes:
##                break
##            sz += len(bytes)
##            outfile.resize(sz)
##            outfile.write(bytes)
##        outfile.close()
##        infile.close()

def shutil_copy(infname, outfname):
    shutil.copyfileobj(open(infname, 'rb'), open(outfname, 'wb'))

def main():
    from timeit import Timer
    for infname, times in [('test source.log', 20.0), ('Dexter S06E02.mkv', 2.0), ('small file.txt', 500)]:
        sz = os.path.getsize(infname)
        print 'Input file size: %s MiB; iteration count: %s' % ((sz / (1024 * 1024)), times)
        outfname = infname + '.tmp'
        for func in ['shutil_copy', 'mmap_copy', 'copy1', 'copy3']:
           try:
               t = Timer('%s("%s", "%s")' % (func, infname, outfname), 'from __main__ import mmap_copy, copy1, copy3, shutil_copy')
               total_time = t.timeit(number=int(times))
               print '\tFunction name: %s; total time: %s sec; one iteration: %s' % (func, total_time, total_time / times)
           except Exception, e:
               print e

if __name__ == '__main__':
    main()