I'm writing an application to catalog a very large number of files (2 million+ files)on a share drive. For each file in the directory tree, I'm adding a record to a mysql database. The share drive is active, in that files will be regularly added; thus the app will need to update the database periodically as well.

The database is initially set to have a auto incremented integer field as the primary key.

My question is, as the database needs to be updated, what is the most efficient way to determine if a file record already exists in the database, prior to inserting it. I don't think a query should be performed, each time a file is encountered, to see if it's already been loaded. Is there a better option for a primary key (like the full path name of the file)?




yes there is, as you have already noticed: if you want fast access on the file name, the primary key should be this file name, provided that it is unique. If you had decided an autoincrement to be PK as usually, you would have to create an additional index on column filename. Because for PK an unique index is always automatically generated, you need to manage both indexes in parallel. Therefore, autoincrement pk is somewhat contra-productive.

Well, you are talking of about 2 million + rows what is not really a great amount of table rows (I am involved in a database which has a monthly delta of about 8 millions rows). You can speed-up access by using a hash index too, if you don't need an order on the file name.

The crucial point will be how to load a database with an already existing bulk of rows, If you want to start with, say about 1.5 mill, you can't make use of standard inserts.

-- tesu


II agree with Tesu but wanted to throw this in. Your shared drive must have some subdirectories and you may run into cases where there are duplicate file names in different directories and the files are not the same or are unrelated to each other. Do you plan to consider the full path of the file as a unique name or just the file name its self?

Also do you want to track the creator or owner of the file in your database?

Just trying to help you cover the bases.