can I somehow extract single file from a zip archive without replicating directory structure inside zip file ?
for example file I need is :
archive.zip/folder1/folder2/fileIneed.doc
when I use zipfile.extract I get the file, but I get it in destination folder with same directory structure.

I'm at home now and my py file is at work, so I can't paste real code.

Recommended Answers

All 11 Replies

You could create file of same name from read in file from ziparchive, you will not get same file attributes though:

from zipfile import ZipFile as zf
with zf('j:/Lataukset/Vb6501.zip') as f, open('MSBIND.DLL', 'wb') as out:
    out.write(f.read('Vb6501/MSBIND.DLL'))

will check it out in the morning, thanx for the idea.

I havent made my problem clear enough, so I'll do it now :)
I have a folder with ~2000 *.docx files that I want to loop through and extract all images.

code I have so far is following :

import os, time,  re,  Image, zipfile
t0 = time.clock()
path = 'C:\\kontrolneliste\\docx\\'
for (path, dirs, files) in os.walk(path):
    for file in files:
        fname = file[:7]
        docx = path + '\\' + fname + '.docx'
        print docx
        destinationPath = 'c:\\aa\\' + fname + '\\'
        if not os.path.isdir(destinationPath):
            os.mkdir(destinationPath)
        sourceZip = zipfile.ZipFile(docx)
        for name in sourceZip.namelist():
            print name
            if name.find('word/media/')!= -1 :
                print re.sub('word/media/','',destinationPath)
                sourceZip.extract(name,destinationPath)
        sourceZip.close()
##    print len(files)
exectime = time.clock() - t0
print '--------------------------------------'
print 'Executed in: ', round(exectime,2), "seconds"
os.system('pause')

this gives me the following error once I run it :

C:\kontrolneliste\docx\\0000110.docx
Traceback (most recent call last):
  File "C:\py\KListe_unzipper.py", line 12, in <module>
    sourceZip = zipfile.ZipFile(docx)
  File "C:\Python26\lib\zipfile.py", line 693, in __init__
    self._GetContents()
  File "C:\Python26\lib\zipfile.py", line 713, in _GetContents
    self._RealGetContents()
  File "C:\Python26\lib\zipfile.py", line 723, in _RealGetContents
    endrec = _EndRecData(fp)
  File "C:\Python26\lib\zipfile.py", line 189, in _EndRecData
    fpin.seek(-sizeEndCentDir, 2)
IOError: [Errno 22] Invalid argument

now, problem is that yesterday it worked, but it made directories within destination path like this :
c:\aa\0000110\word\media\image1.png

this morning I messed something up without backing up the working version... :!

help plz :)

for some reason, it's working now again :

import os, time,  re,  Image, zipfile
t0 = time.clock()
path = "C:\\kontrolneliste\\docx\\"
for (path, dirs, files) in os.walk(path):
    for file in files:
        fname = file[:7]
        docx = path + '\\' + fname + '.docx'
        print docx
        destinationPath = 'c:\\aa\\' + fname + '\\'
        if not os.path.isdir(destinationPath):
            os.mkdir(destinationPath)
        sourceZip = zipfile.ZipFile(docx)
        for name in sourceZip.namelist():
            if name.find('word/media/')!= -1 :
                print re.sub('word/media/','',destinationPath)
                sourceZip.extract(name,destinationPath)
        sourceZip.close()
exectime = time.clock() - t0
print '--------------------------------------'
print 'Executed in: ', round(exectime,2), "seconds"
os.system('pause')

but again I get this structure : http://img831.imageshack.us/img831/1314/imgdn.jpg

@ Tech B
your script extracts everything, the way it should, but it stores all in one folder, overwriting all previous files.
so, is there some workarround to fix my script to extract into folder c:\aa\0000110\image1.png,
instead of
c:\aa\0000110\word\media\image1.png
or to extract em all in one run, and in next run to move them from word/media into root folders with second loop ?

for some reason, it's working now again :

import os, time,  re,  Image, zipfile
t0 = time.clock()
path = "C:\\kontrolneliste\\docx\\"
for (path, dirs, files) in os.walk(path):
    for file in files:
        fname = file[:7]
        docx = path + '\\' + fname + '.docx'
        print docx
        destinationPath = 'c:\\aa\\' + fname + '\\'
        if not os.path.isdir(destinationPath):
            os.mkdir(destinationPath)
        sourceZip = zipfile.ZipFile(docx)
        for name in sourceZip.namelist():
            if name.find('word/media/')!= -1 :
                print re.sub('word/media/','',destinationPath)
                sourceZip.extract(name,destinationPath)
        sourceZip.close()
exectime = time.clock() - t0
print '--------------------------------------'
print 'Executed in: ', round(exectime,2), "seconds"
os.system('pause')

but again I get this structure : http://img831.imageshack.us/img831/1314/imgdn.jpg

@ Tech B
your script extracts everything, the way it should, but it stores all in one folder, overwriting all previous files.
so, is there some workarround to fix my script to extract into folder c:\aa\0000110\image1.png,
instead of
c:\aa\0000110\word\media\image1.png
or to extract em all in one run, and in next run to move them from word/media into root folders with second loop ?

I don't understand the print re.sub('word/media/','',destinationPath) . You probably meant to remove word/media/ from the path, and it should be destinationPath = re.sub(r'word[/\\]media[/\\]','',destinationPath) .

tried that too (with my re.sub, and now with your too), but I am still getting the same output structure.

tried that too (with my re.sub, and now with your too), but I am still getting the same output structure.

There seems to be a mistake: shouldn't you remove word/media/ from name instead of destinationPath ?

Here is how I would try to write it

import time,  Image, zipfile
from kernilis.path import path
t0 = time.clock()
path = path("C:")/"kontrolneliste"/"docx"
word_media = path('word', 'media', '')
for (root, dirs, files) in os.walk(path):
    for file in files:
        fname = file[:7]
        docx = root/(fname + '.docx')
        print docx
        destinationPath = path('c:')/'aa'/fname
        if not destinationPath.isdir():
            destinationPath.mkdir()
        sourceZip = zipfile.ZipFile(docx)
        for name in sourceZip.namelist():
            if name. :
                name = path(*path(name).splitall())
                name = name.replace(word_media,'')
                sourceZip.extract(name,destinationPath)
        sourceZip.close()
exectime = time.clock() - t0
print '--------------------------------------'
print 'Executed in: ', round(exectime,2), "seconds"
os.system('pause')

I'm using a version of J Orendorff's very useful path module (I added and modified a few features), which I call kernilis.path. See the attached file.

first I get syntax error for line :
if name. :
then

Traceback (most recent call last):
  File "C:\py\_____3.py", line 2, in <module>
    from kernilis.path import path
ImportError: No module named kernilis.path

then I renamed path.py with kernilis.path.py into lib/site-packages, got error ,then renamed py file back to path.py and changed code to this :

import time,  Image, zipfile
from path import path

that gave me

File "C:\py\_____3.py", line 5, in <module>
    word_media = path('word', 'media', '')
TypeError: 'path' object is not callable

so now im confused :)
I'll try some things and will keep this thread up to date. I hope I'll solve this and maybe some1 else will benefit from it one day too :]

first I get syntax error for line :
if name. :
then

Traceback (most recent call last):
  File "C:\py\_____3.py", line 2, in <module>
    from kernilis.path import path
ImportError: No module named kernilis.path

then I renamed path.py with kernilis.path.py into lib/site-packages, got error ,then renamed py file back to path.py and changed code to this :

import time,  Image, zipfile
from path import path

that gave me

File "C:\py\_____3.py", line 5, in <module>
    word_media = path('word', 'media', '')
TypeError: 'path' object is not callable

so now im confused :)
I'll try some things and will keep this thread up to date. I hope I'll solve this and maybe some1 else will benefit from it one day too :]

It's because at last 4 I kept your variable named path. It erases your import path.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.