I'm trying to parse through a large file (544,000+ rows). I'm pulling information from each row and creating a SQL statement with that information, and trying to remove duplicates.
Here is my current code:
try: fdata = open(vhdat).readlines() except IOError, detail: print "Error opening file. ", detail # Define list variables vhlist =  vhlist = list(fdata) sqlstring = '' # Process the .dat file. for x in range(len(vhlist)): vhstring = str(vhlist[x]) table = string.maketrans('\n', ' ') vhstring = vhstring.translate(table) vhstring = vhstring.strip() NPANXX = vhstring[:6] # NPA NXXType = vhstring[16:18] # NXXTYPE MVC = vhstring[45:50] # Major Vertical Coordinate MHC = vhstring[50:55] # Major Horizontal Coordinate RCLATA = vhstring[55:58] # Rate Center LATA SPCA = vhstring[81:83] # State, Province, or County Abbreviation DIND = vhstring # Dialable Indicator OLRS = vhstring[90:92] # Other Line Rate Step PID = vhstring # Point Identification if DIND == '1': DIND = 'Y' else: DIND = 'N' # Create SQL Statement if NXXType == '00': sqltemp = 'insert into database.table values (scn_telco.npanxx_id_seq.nextval,0,\ \'%s\',\'%s\',%s,%s,\'%s\',\'%s\',\'%s\',\'X\',\'%s\',%s,\'\');\n' % (NPANXX, RCLATA, MVC, MHC, DIND, NXXType, OLRS, SPCA, PID) # Look for duplicates. Do not add line to sqlstring if sqltemp string already exists if sqlstring.find(sqltemp) == -1: # Print the record number so I can see where the script is while processing. # This print line will be removed in production. print x sqlstring = sqlstring + sqltemp else: pass else: pass f = open('C:\Python25\VHSQL.txt', 'w') f.write(sqlstring) f.close
The problem is this:
There are over 544,000 lines in the original file. The script has been running for over 2.5 hours and is only (as I type this) on record 71,672. At this rate, it'll take close to an entire day to run.
Is there a more efficient way to check for the duplicates? I'm fairly new to programming in general (and Python in particular) so any help would be appreciated.