Hey,

I've just joined up here hoping somebody might be able to help me with a project I've got on at work at the moment.

I've been learning python using the method, let's just do it and see what happens and I appear to be coming up to conflicts consistently and am now 100% stuck on where else to head.

Basically, I've got a CSV with 4 columns in it:

Domain = string
Page = string
Linking = string
Size = integer

I need to complete various functions on these that seemed only basic to me at first but soon got really complicated.

I'm converting my CSV to a graphml file (xml based) that will run in yED.

I need to be able to get a list of all the 'nodes':
Every unique item within the 'Domain' column is a node
Every unique item within the 'Linking' column is a node
Every item within the 'Page' column is a node - however, this is where it gets complicated really and I'm struggling to put it in plain text, every Unique version of the 'Domain' and 'Page' column needs to be listed, i.e. if "Page1" was listed twice but the 'Domain' column was different for these occurences "Page1" would need to be listed twice (I decided to do this with MD5 Hash Tags)

That is the first stage of this project anyway, there is another bit after (connecting all the nodes up) but I can't get onto that until I solve this :(

This is the code that I currently use:

#Import needed packages
import csv, array, md5, decimal
from useful_funcs import collections

#Import CSV (or Database in future)
inputFile = open("C:\\Users\\RobH\\Desktop\\xml.csv", "r")
reader = csv.reader(inputFile)

#Declare memoryTable
memoryTable = []

#Store CSV (or DB) in memoryTable
for row in reader:
    memoryTable.append(row) 

#Declare hashTable
hashTable = []

#Declare hash2Table
hash2Table = []

#Hash columns 0 and 1
n=0
for r in memoryTable:
    i = 0
    string2Hash = ''
    while i < len(r)-2:
       string2Hash += r[i]
       i+=1

#Get MD5 of hash
    string2Hashmd5 = md5.new(string2Hash)
    string2Hashmd51 = string2Hashmd5.hexdigest()

#Append hash to memoryTable
    memoryTable[n].append(string2Hashmd51)
    n+=1
    
#Hash2 columns 0, 1 and 2
n=0
for r in memoryTable:
    i = 0
    string4Hash = ''
    while i < len(r)-2:
       string4Hash += r[i]
       i+=1
       
#Get MD5 of hash2
    string4Hashmd5 = md5.new(string4Hash)
    string4Hashmd51 = string4Hashmd5.hexdigest()
   
#Append hash2 to memoryTable
    memoryTable[n].append(string4Hashmd51)
    n+=1

#Sort memoryTable
from operator import itemgetter
memoryTable.sort(key=itemgetter(4))

#Copy memoryTable to hashTable and hash2Table
for row in memoryTable:
    hashTable.append(row)
    hash2Table.append(row) 

#Remove all hash duplicates from hashTable
hashTable2 = collections.removeduplicates(hashTable,4)
#collections.printy(hashTable2)

#Search memoryTable for hash duplicates and add up all values for first edges

#Append added up hash values to hashTable

#Remove all hash2 duplicates from hash2Table (nodes)
hash2Table2 = collections.removeduplicates(hash2Table,5)
#collections.printy(hash2Table2)

#Search memoryTable for hash2 duplicates and add up all values for second edges
HashSize = []
roww = 0
for r in hash2Table2:
    Col3 = [hash2Table2[roww][5]]
    HashSize.append(Col3)
    roww+= 1
#collections.printy(HashSize)

#collections.printy(memoryTable)
#Append added up hash2 values to hash2Table

hash2Table2_2 = list(hash2Table2)
i = 0
while i < len(HashSize)+1:
    x = 0
    templist = []
    for r in HashSize:
        if r[0] == memoryTable[x][5]:
            templist.append(memoryTable[x][3])
        x+= 1
    y = 0
    templist1 = []
    while y < len(templist):
        numberr = decimal.Decimal(templist[y]) * 100
        templist1.append(numberr / 100)
        y+=1
    templist2 = sum(templist1)
    #print templist2
    hash2Table2_2.append(templist2)
    i+= 1
    #collections.printy(hash2Table2_2)

#something isn't working right... not sure what




#value = int(templist[0])
#print value
#listy = sum(r[0] for r in templist)
#print listy
#collections.printy(hash2Table2_2)

and the collections package is:

def printy(hashTable):
    ret = ''
    for r in hashTable:
        print r

def removeduplicates(hashTable,column):
    ret = ''
    listOfHashTable = list(hashTable)
    col = column
    prev = 0
    i = 0
    z = 1
    while i != z:
        z = len(listOfHashTable)
        for r in listOfHashTable:
            if r[col] == prev:
                rownumb = listOfHashTable.index(r)
                listOfHashTable.pop(rownumb)
            prev = r[col]
        i = len(listOfHashTable)
    return tuple(listOfHashTable)

If nobody wants to help me that's ok - I'm sure I'll solve it at some point but at the moment it's really REALLY annoying me :(

Thanks a lot,
Rob

Recommended Answers

All 3 Replies

Sounds like you need a class to store all this data. Take this example csv file:

my_csv.csv

Name,Age
Bill,43
Eric,20

Okay, call this function on the csv file to get the data out.

def read_csv(csv_file):
    ss = open(csv_file)
    reader = csv.DictReader(ss)
    data = []
    while True:
        try:
            rdr = reader.next()
            data.append(rdr)
        except StopIteration:
            break
    ss.close()
    return data

data = read_csv('my_csv.csv')

Using the very useful DictReader class, data (the return value) would contain this:

[{'Name': 'Bill','Age':43},{'Name':Eric,'Age':'20'}]

Sweet, now loop through the list and for each dict call a class like the one below:

class MyClass:
   def __init__(self,output_dict)
          self.name = output_dict['Name']
          self.age = output_dict['Age']

Put all those classes in another list. Now you can loop through the list and compare various columns against the other columns like this:

for x in class_list:
      print x.name
      print x.age

Anyway, hope this helps. If I wanted a list of all nodes I could say:

for x in class_list:
        name_list.append(x.name)
        age_list.append(x.age)

node_list = name_list + age_list

Cool! thanks a lot Zac, I didn't use everything you said because I couldn't get my head around the classes (got quite a few different projects at the moment confusing me) but I used a big chunk of your code and just totally rebuilt my current module and this now works!

Now I'm going to go onto part 2 which is where things are going to get even more tricky for me, but we'll see what happens

Thanks again!

If you are interested this is what I've used:

#Import needed packages
import csv, array, md5, decimal, pdb
from useful_funcs import collections

#Debugger
#pdb.set_trace()

#Import CSV (or Database in future)
inputFile = open("C:\\Users\\RobH\\Desktop\\xml.csv", "r")
reader = csv.reader(inputFile)

#Declare memoryTable
Table = []

#Store CSV (or DB) in memoryTable
for row in reader:
    Table.append(row) 

#Count row numbers
row_number = 0
for row in Table:
    row_number+=1

#add MD5 of domain and category to each row
n=0
x=0
while x < row_number:
    i = 0
    string2Hash = ''
    while i < len(Table[x])-2:
        #print Table[x][i]
        string2Hash += Table[x][i]
        i+=1
    x+=1
    
    string2Hashmd5 = md5.new(string2Hash)
    string2Hashmd51 = string2Hashmd5.hexdigest()
    Table[n].append(string2Hashmd51)
    n+=1

from operator import itemgetter
Table.sort(key=itemgetter(4))

Domain_list = []
r = 0
while r < row_number:
    Domain_list.append(Table[r][0])
    r+=1
#print Domain_list

Link_list = []
r = 0
while r < row_number:
    Link_list.append(Table[r][2])
    r+=1
#print Link_list

tables = []
tables = collections.removeduplicates(Table,4)

row_number = 0
for row in tables:
    row_number+=1

Category_list = []
r = 0
while r < row_number:
    Category_list.append(tables[r][1])
    r+=1
    
#PRINT LIST OF NODES DONEEEEEE!!!!!!
#collections.printy(Domain_list + Category_list + Link_list)

Nice man. Good luck. Come on back if you run into more problems.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.