I'm trying to parse through a large file (544,000+ rows). I'm pulling information from each row and creating a SQL statement with that information, and trying to remove duplicates.

Here is my current code:

try:
    fdata = open(vhdat).readlines()

except IOError, detail:
    print "Error opening file. ", detail



# Define list variables
vhlist = []
vhlist = list(fdata)
sqlstring = ''



# Process the .dat file.
for x in range(len(vhlist)):
    vhstring = str(vhlist[x])
    table = string.maketrans('\n', ' ')
    vhstring = vhstring.translate(table)
    vhstring = vhstring.strip()
    NPANXX = vhstring[:6]                                # NPA
    NXXType = vhstring[16:18]                           # NXXTYPE
    MVC = vhstring[45:50]                             # Major Vertical Coordinate
    MHC = vhstring[50:55]                             # Major Horizontal Coordinate
    RCLATA = vhstring[55:58]                          # Rate Center LATA
    SPCA = vhstring[81:83]                            # State, Province, or County Abbreviation
    DIND = vhstring[89]                               # Dialable Indicator
    OLRS = vhstring[90:92]                            # Other Line Rate Step
    PID = vhstring[92]                                # Point Identification

    if DIND == '1':
        DIND = 'Y'
    else:
        DIND = 'N'

    # Create SQL Statement
    if NXXType == '00':
        sqltemp = 'insert into database.table values (scn_telco.npanxx_id_seq.nextval,0,\
\'%s\',\'%s\',%s,%s,\'%s\',\'%s\',\'%s\',\'X\',\'%s\',%s,\'\');\n' % (NPANXX, RCLATA, MVC, MHC, DIND, NXXType, OLRS, SPCA, PID)

        # Look for duplicates. Do not add line to sqlstring if sqltemp string already exists
        if sqlstring.find(sqltemp) == -1:
            # Print the record number so I can see where the script is while processing.
            # This print line will be removed in production.
            print x
            sqlstring = sqlstring + sqltemp
        else:
            pass

    else:
        pass


    f = open('C:\Python25\VHSQL.txt', 'w')
    f.write(sqlstring)
    f.close

The problem is this:
There are over 544,000 lines in the original file. The script has been running for over 2.5 hours and is only (as I type this) on record 71,672. At this rate, it'll take close to an entire day to run.

Is there a more efficient way to check for the duplicates? I'm fairly new to programming in general (and Python in particular) so any help would be appreciated.

I moved the file operation outside of the for loop and it sped up considerably. I didn't even realize it was inside the loop until I took a break from the code.

Your program reads in all the lines into one list of lines and then even makes one copy of this rather large list. The code below reads in the lines as needed for processing:

...

for vhstring in file(vhdat):
    # remove trailing whitespace
    vhstring = vhstring.rstrip()
    # your processing starts here ...
    NPANXX = vhstring[:6]             # NPA
    NXXType = vhstring[16:18]
    ...

Your program reads in all the lines into one list of lines and then even makes one copy of this rather large list. The code below reads in the lines as needed for processing:

...

for vhstring in file(vhdat):
    # remove trailing whitespace
    vhstring = vhstring.rstrip()
    # your processing starts here ...
    NPANXX = vhstring[:6]             # NPA
    NXXType = vhstring[16:18]
    ...

Thanks, Bumsfeld. That does make more sense.

You can read the data into a list. As stated above, the following statement is already a list, so no need for the copy.
fdata = open(vhdat).readlines()
If there are 544,000 recs and each rec is 100 bytes, that is 54,400,000 bytes (54 megs) of memory. That's very doable in today's world. Memory is thousands of times faster than disk. Anyway, you then want to sort fdata on whatever fields you define as causing a duplicate. Lets call this the index. Process the list one record at a time, and only add to the SQL file if the index does not equal the previous index. I would suspect that the majority of the time spent is with the SQL file. The word "sort" can mean any number of algorithms, and that will be the time consuming part with 544,000 recs. You probably want to go through the list and place all recs less than "x" in a second list, placing non-x in another list for the next pass, sort the list of "x" recs and process. Increase "x" by some increment, etc, etc. The optimum value for x depends on the length of the record, but test the total time it takes on powers of ten, that is 1000 recs per, 10,000 recs per, and 100.000 recs per, which will tell you the range you want to be in. As a very, very general rule, it is much faster sorting 544 batches of 1000 recs than one batch of 544,000. HTH, it's late so post back if this isn't clear.

Remember, function calls are expensive in Python (and some other laguages), so keep them to a minimum within the loop.

I'm not sure it is required to sort the lists.
You may proceed just as woooee says about soring keys in lists and then simply test "if not key in list".
Something like (i just do 1 list here to be simple)

alreadyStored=[]
for line in file(filename):
    if not (line in alreadyStored): # You may use part of the line, depending on the key you want to use
        generateSqlInsert() # You know how to do this
        alreadyStored.append(line)

Of course, it is better to define more than one list but I think than testing presence in the list is faster than sorting (I didn't test so maybe am I wrong)

By the way, to increase readability, instead of writing

'insert into database.table values (scn_telco.npanxx_id_seq.nextval,0,\
\'%s\',\'%s\',%s,%s,\'%s\',\'%s\',\'%s\',\'X\',\'%s\',%s,\'\');\n'

You can use "

"insert into database.table values (scn_telco.npanxx_id_seq.nextval,0,\
'%s','%s',%s,%s,'%s','%s','%s','X','%s',%s,'');\n"

Looks better no ?

Using a dictionary or set would be faster than "if not key in list" as they are indexed. "If not key in list" has to traverse the list until it finds a key, or until the end is reached if there is no match. It has to do this every time which makes it inefficient for larger lists. A sort may or may not be faster depending on the sort. But there has been no reponse, so I think we may have tried to go too far too fast and just confused the OP.

Using a dictionary or set would be faster than "if not key in list" as they are indexed. "If not key in list" has to traverse the list until it finds a key, or until the end is reached if there is no match. It has to do this every time which makes it inefficient for larger lists. A sort may or may not be faster depending on the sort. But there has been no reponse, so I think we may have tried to go too far too fast and just confused the OP.

I'm trying out different methods here to time the script running, and was waiting for results before I reposted.

The issue is the entire Insert statement would be the key, not simply a part of it. The entire statement has to be unique in the final file which contains the statements.

How would I do this with a dictionary?

You don't need the insert statement, only the line...

@woooee : The key in list may not be the fastest but (just as you do for your sorting) if you cut the list in a dictionnary of 1000 lists
something like (very simple sample) :

import String

dict={}
for i in range(1000):
    dict[String.zfill(i,3)]=[]  # to fill with 0 (001)
for line in file(filename):
    if not (line in dict[line[x:x+3]):   # x points somewhere in the line where you are waiting 3 following numbers
        generateSQLStatement()
        dict[line[x:x+3]].append(line)

then, you only have to search ONE list.
When you sort, you have to sort all the lists...
I'm not sure this is better than sorting, i didn't make tests, but this is certainly simpler to write if you have to write your own sorting algorithm ;)
This may not be the best solution, I agree. In fact, i don't have very often to deal with this kind of problems...

Anyway, about the dictionary (probably faster anyway), you have have at least 2 or 3 ways to do it.
The simplest (i don't know if it is the most efficient) may be :

dict={}
for line in file(filename):
    if not dict.has_key(line):
        generateSQLStatement()
        dict[line]=None # we don't care about the value

Of course here again you'd better create numbers of sub-dictionaries (for example as i showed just before for the lists)

Ok. Doing this:

for vhstring in file(vhlist):
....... # process string

took 3 hours 9 minutes to process.

Doing this:

vhdat = "file location"
fdata = open(vhdat).readlines()

for vhstring in fdata:
...... # process string

Took 58 minutes to achieve the same results.

This is something that will run only once per month at 2 am, so I think 58 minutes is acceptable.

The dict stuff is a bit over my head at the moment, but it may be more efficient and I'll check into that as I learn more about Python. I'm still getting my feet wet. =)

Thanks for everyone's help.

Sorry if this is explained too fast.

About dictionaries.
A dictionary is a type which variables are built like this
dict={"key1":"value1", "key2":"value2,...}
Keys and values can be whatever you want.
each key is unique and you can recall a value like this
value=dict["key1"]

So, to be sure you haven't any duplicate, you can create a dictionary that has your lines as keys :
dict[line]=None (or whatever you want, we only use it for the keys).

Before creating this new record, you can test if the key already exists by using dict.has_key(line)
If this returns True, the key exists so you have already processed a line like this.

for line in listLines:
    if not dict.has_key(line): # We test if the line is processed
        generateSQLStatement()
        dict[line]=''   # We say that this line is processed.

That's all. But anyway, if your solution is OK...

Ok. I used a dictionary and it ran in 13 minutes. Huge difference.

Thanks for everyone's help on this. I've learned some about Python and that's always a good thing. :)

Now remember that dictionary keys will not necessarily be in the same order you put them in when you created the dictionary.

In this case, this is not a problem because dictionary is used only for a control purpose. The order of the sql statements is the same as the in file's one...

This question has already been answered. Start a new discussion instead.