I want to compare a list of files with extension .rtf in my directory with a list of files at a given url and download the files at the url not found in my directory.

This is where I am at but cannot figure how to filter a list based on file extension.

import urllib

file_list = urllib.urlretrieve("http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/")
dir_list = urllib.urlretrieve("c:/MyPy/")
# create a list of files from url
url_files = []
url_files = file_list
# create a list of files from directory
dir_files = []
dir_files = dir_list
# compare lists -  for *.rtf files url_files not in dir_files download
for url_files in dir_files.iteritems():
	del url_files
# can't figure out how filter a list by file extension.
# download those rtf files not in dir_files to c:/MyPy

Example:

filetype = 'rtf'
files = ['doc.txt', 'stuff.rtf', 'artfunction.exe', 'rtfunc.bat', 'doc2.rtf']

print('\n'.join(filename for filename in files if filename.endswith('.'+filetype)))

BTW iteritems is being deprecated, use items method instead.

Edited 5 Years Ago by pyTony: n/a

Another method

import fnmatch
pattern = '*.rtf'
files = ['doc.txt', 'stuff.rtf', 'artfunction.exe', 'rtfunc.bat', 'doc2.rtf']
print('\n'.join(filename for filename in fnmatch.filter(files, pattern)))

I have got it to this

import urllib
import fnmatch
import os

"""Module to download only files not in directory from a given url"""
file_list = urllib.urlretrieve("http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/")
path="C:\\MyPy"
dir_list=os.listdir(path)
pattern = '*.rtf'
# create a list of files from url
url_files = []
url_files = ('\n'.join(filename for filename in fnmatch.filter(file_list, pattern)))
# create a list of files from directory
dir_files = []
dir_files = ('\n'.join(filename for filename in fnmatch.filter(dir_list, pattern)))
# compare lists -  for *.rtf files url_files not in dir_files download
for url_files in dir_files.items():
    del url_files
print(url_files)
# can't figure out how filter a list by file extension.
# download those rtf files not in dir_files to c:/MyPy

But I am receiving a HTTP error?

>>> python -u "retrieve.py"
Traceback (most recent call last):
  File "retrieve.py", line 12, in <module>
    url_files = ('\n'.join(filename for filename in fnmatch.filter(file_list, pattern)))
  File "C:\Python27\lib\fnmatch.py", line 63, in filter
    if match(os.path.normcase(name)):
  File "C:\Python27\lib\ntpath.py", line 46, in normcase
    return s.replace("/", "\\").lower()
AttributeError: HTTPMessage instance has no attribute 'replace'
>>> Exit Code: 1

You have not proper file list:

>>> file_list = urllib.urlretrieve("http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/")
>>> print(file_list)
('c:\\docume~1\\veijal~1.yks\\locals~1\\temp\\tmpq6ol5c', <httplib.HTTPMessage instance at 0x00ED12B0>)
>>> help(urllib.urlretrieve)
Help on function urlretrieve in module urllib:

urlretrieve(url, filename=None, reporthook=None, data=None)

>>> print(list(file_list[1]))
['content-length', 'set-cookie', 'expires', 'server', 'connection', 'date', 'content-type']
>>> print(open(file_list[0]).read())


<html>
<title>tvn.com.au</title>
<body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0">
<table width="100%" height="100%" border="0" cellpadding="0" cellspacing="0">
  <tr>
    <td align="center"><img src="/tvnlive/v1/system/modules/org.tvn.website/resources/graphics/g_404_error.gif">
    <br><br>
    <font color="#919191"; size="1"; face="Verdana">The page you requested is not available. Please click <a href="/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_main_menu.jsp;jsessionid=53391838F909BEE13C315632CE7E2BC1.tvnEngine2?TVNSESSION=53391838F909BEE13C315632CE7E2BC1.tvnEngine2" style="color: red; text-decoration: none;  image-decoration: none; font-weight: bolder;">here</a> to return to the homepage.</font>
    </td>
  </tr>
</table>
</body>
</html>
>>>

Print the values to check what you have.

This article has been dead for over six months. Start a new discussion instead.