finding partial duplicate data in csv

wolf29 0 Newbie Poster

11 Years Ago

I have a program that takes csv values, parses the content and inserts it in an sqlite3 database.
I am using sqlite3 to save on space and to make the py scripts a little more portable.
The next refactoring may well need to have postgresql or mysql to deal with the long-term inventory-detail.

This is a Qualys security report munger, but it is basically the same as a inventory/sales app.

We have customers (the hosts and their IP address and OS) -- MHosts
        inventory (Qualys QID for each vulnerability)     -- QIDs
A)      invoices (The curent test from Qualys)            -- Title_Block
        invoice-line-item (each vulnerability found)      -- Events

I was breaking the data into Title_Block and Vulnerabilities. The Vulnerabilities table was the entire MHost+QUD+Event for each event detected (all row information from all rows).

It seemed to make more sense to break the data into tables as noted in section A above, but the programming is still wasting a bunch of space.
For instance, each MHost entry from the qualys report is going into the table

SELECT count(*) FROM mhosts;
4500
SELECT count(DISTINCT IP) FROM mhosts;
147

The Events table handles how many times a given IP enters the picture, as IP as a field in that table too.
I have the same issue in the QIDs table. QIDs appear <= per host.

This example of what I am doing comes from https://github.com/wolf29/csvEdit/blob/master/test.py

def load_mhost(f, outdir, d):
    filename = f
    current_db = outdir+'/'+d
    with open(outdir+'/'+filename, 'rb') as mycsv:
        con = lite.connect(outdir+'/'+d)
        cur = con.cursor()
        cur = con.commit()
        reader = csv.reader(mycsv)
        counter = 0
        counter2 = 0
        for counter,row in enumerate(reader):
            host = (row[0], row[1], row[2])

            try:
                con = lite.connect(outdir+'/'+d)

                with con:
                    cur = con.cursor()

                    cur.execute("CREATE TABLE IF NOT EXISTS mhosts(Id INTEGER PRIMARY KEY, IP TEXT, NetBIOS TEXT, OS TEXT)")
                    cur.execute("INSERT INTO mhosts(IP, NetBIOS, OS) VALUES(?, ?, ?)", host)
                    con.commit()
                    counter2 = counter2 + 1

            except lite.Error, e:
                if con:
                    con.rollback()
                    print "Error %s:" % e.args[0]
                    sys.exit(1)

            finally:
                if con:
                    con.close()

I think I could add something to the cur.execute("INSERT INTO mhosts(IP, NetBIOS, OS) VALUES(?, ?, ?)", host) like cur.execute("INSERT INTO mhosts(IP IF NOT EXIST, NetBIOS, OS) VALUES(?, ?, ?)", host)
Since the current script only looks at one test at a time, there is no chance that an IP address might be reused as some other machine name or OS. I don't know if this is the best way to handle this.

Then I just have to sort out how to find most events per mhost, most common QID, mhost with highest-CVSS-value exploitable QIDs. That is probably going to be another script.

1 Contributor
0 Replies
176 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.