Hello,

Programming/Python newb here looking for some help with arrays. I am trying to write a parsing program that takes a comma delimited csv file, compares entries, and outputs the comparisons in a particular format.

The csv file has two columns. Col [0] contains article identifiers, col[1] contains assigned keywords. It looks something like this:

00002944423, antibiotics
00002944423, resistance
00002953423, mortality

I wrote the following code, which compares the entry in row i, col 0, with the entry in row i+1, col 0. If the article id's are equal, then I need to print out the two keywords on the same line.

import csv

edgelist1=[] #initialize the first edge list
listmax = 0 #use to calculate the largest index value of the array

#import the article_kwd file
reader = csv.reader(open('C:/scientometric/kw.csv', 'r'), delimiter = ',', quoting=csv.QUOTE_NONE)
for row in reader:
   edgelist1.append (row)
   listmax = listmax+1


#use to determine whether or not two keywords are connected
#by checking to see if they are both assigned to the same article
for i in range (0,listmax):
    nextrow=i+1
    while edgelist1[i][0]==edgelist1[nextrow][0]:
       print (edgelist1[i][1],edgelist1[nextrow][1])
       nextrow=nextrow+1

This works really well. It produces an output like this:
ANTIMICROBIAL RESISTANCE CARRIAGE
ANTIMICROBIAL RESISTANCE EXPERIENCE
ANTIMICROBIAL RESISTANCE INFECTION
ANTIMICROBIAL RESISTANCE PNEUMOCOCCAL-INTERVENTION-PROJECT
ANTIMICROBIAL RESISTANCE SEROTYPES
ANTIMICROBIAL RESISTANCE SPAIN
ANTIMICROBIAL RESISTANCE STRAINS

However, I need to take this one step further. I need to replace the keywords with their key values. The output needs to look something like this:

1 2
1 3
1 4
2 3

What I have been trying to do is write the output into an interim array. However, no matter what I try, I keep getting an error message that the list index is out of range.

Any advice on how to get past this?

The error is in this line
print (edgelist1[1],edgelist1[nextrow][1])
when you get to the last record, nextrow points beyond the end, so you want to use
for i in range (0,listmax-1):
since there is nothing to compare the last record to. Alternatively, you can use
for i in range (1, listmax):
and compare this record to the previous record. Also you want to be sure that the records are in numerical order, otherwise sort them first.

Second, you do not need the while loop, as the for loop goes through all of the records, and you could get duplicate output, one from the while loop, and one from the for loop. You can then print the index value from the for loop which is also the record number (or index+1 if you want the first record to be one instead of zero). Finally, it is bad practice to use "i", "l", or "o" as single digit variable names as they can look like numbers.

for x in range (0,listmax-1):
    nextrow=x+1
    if edgelist1[x][0]==edgelist1[nextrow][0]:
       ##   how this print statement should be formated depends on
       ##   the Python version you are using
       print ("rec # %d  %s  %s" % (x+1, edgelist1[x][1], edgelist1[nextrow][1]))

I tried the code, and there was a slight problem with it - it did not extract all the possible combinations for any given article, just the combinations of the first keyword and subsequent keywords. Here is an example (I ran 10 iterations of the main loop):

The output of your code:
rec # 1 ANTIMICROBIAL RESISTANCE CARRIAGE
rec # 2 CARRIAGE EXPERIENCE
rec # 3 EXPERIENCE INFECTION
rec # 4 INFECTION PNEUMOCOCCAL-INTERVENTION-PROJECT
rec # 5 PNEUMOCOCCAL-INTERVENTION-PROJECT SEROTYPES
rec # 6 SEROTYPES SPAIN
rec # 7 SPAIN STRAINS
rec # 9 BILE-DUCT EXPERIENCE

Output of the original code:
ANTIMICROBIAL RESISTANCE CARRIAGE
ANTIMICROBIAL RESISTANCE EXPERIENCE
ANTIMICROBIAL RESISTANCE INFECTION
ANTIMICROBIAL RESISTANCE PNEUMOCOCCAL-INTERVENTION-PROJECT
ANTIMICROBIAL RESISTANCE SEROTYPES
ANTIMICROBIAL RESISTANCE SPAIN
ANTIMICROBIAL RESISTANCE STRAINS
CARRIAGE EXPERIENCE
CARRIAGE INFECTION
CARRIAGE PNEUMOCOCCAL-INTERVENTION-PROJECT
CARRIAGE SEROTYPES
CARRIAGE SPAIN
CARRIAGE STRAINS
EXPERIENCE INFECTION
EXPERIENCE PNEUMOCOCCAL-INTERVENTION-PROJECT
EXPERIENCE SEROTYPES
EXPERIENCE SPAIN
EXPERIENCE STRAINS
INFECTION PNEUMOCOCCAL-INTERVENTION-PROJECT
INFECTION SEROTYPES
INFECTION SPAIN
INFECTION STRAINS
PNEUMOCOCCAL-INTERVENTION-PROJECT SEROTYPES
PNEUMOCOCCAL-INTERVENTION-PROJECT SPAIN
PNEUMOCOCCAL-INTERVENTION-PROJECT STRAINS
SEROTYPES SPAIN
SEROTYPES STRAINS
SPAIN STRAINS
BILE-DUCT EXPERIENCE

Essentially the original code was comparing the first record with the second, third, fourth, etc, then it was comparing the 2nd record with the 3rd, 4th, etc, then 3rd with the 4th. Almost like making a matrix

Does this make any sense?

Your original code will give results until it finds a record that does not match, and not necessarily all of the matches. It is also redundant. You print
ANTIMICROBIAL RESISTANCE CARRIAGE
ANTIMICROBIAL RESISTANCE EXPERIENCE
because they are all equal, then you print
CARRIAGE EXPERIENCE
which is a duplication of the first printing. I don't know if this is what you want or not.

The problem with the original code is that it has to process the file many, many times. That is, each record is compared with every other record (potentially). If the program doesn't take too long to run, then this is OK. Otherwise you want to process the file in one pass. The best way to do this is probably to place the records in a dictionary of lists, where the key is the number, pointing to a list of lists, each list containing whatever you want, the description and the record number perhaps. You then extract any key that has more than one entry. Post the first 10 or so records if you want any more help, as the code snippet that I posted should work for the 3 record examples in your first post.

If you want to use the while() statement then you must test for end of file

while (nextrow < listmax) and \
      (edgelist1[i][0]==edgelist1[nextrow][0]):

This should print the matches once only but it is untested.

nextrow = 0
thisrow = 0
prev_number = ""
while thisrow < listmax-1:
    nextrow = thisrow + 1
    this_record = edgelist1[thisrow]
    while (nextrow < listmax) and \
            (this_record[0]==edgelist1[nextrow][0]):

       ##  print new number
       if this_record[0] != prev_number:
           print ("\n-----", this_record[0])
           prev_number = this_record[0]

       print (thisrow, this_record[1], nextrow, edgelist1[nextrow][1])

       ##   skip over these records
       thisrow = nextrow
       nextrow=nextrow+1
    thisrow += 1

In this case

ANTIMICROBIAL RESISTANCE CARRIAGE != CARRIAGE EXPERIENCE

I managed to come up with a solution to the problem; instead of trying to read the output to an array, I printed it to an interim file in csv format, and from there, read the results back into another array.

Thank you for your help.

This question has already been answered. Start a new discussion instead.