Hi, I am new in python and programming basically.
I have a code that compare 2 files by the first column and prints a file with the common lines. But i need it to compare the first 3 columns and print the file.
I kind of have and idea of changing the if to be something like:

f1line=f1.readlines()
for i in f1line
if (i.split()[1]==f2line[x].split()[1] and i.split()[2]==f2line[x].split()[2]

I am just not sure if that would be right and if it will work when the files don't have the same number of lines

import sys

f1=file(sys.argv[1])
f2=file(sys.argv[2])

d1 ={}
for l in f1:
  c=l.split()[0]
  d1[c]=l

d2 ={}
for l in f2:
  c=l.split()[0]
  d2[c]=l

for i in d1.keys():
  cc=len(i)
  if d2.has_key(i):
     print i, d1[i][cc+1:].strip(),d2[i][cc+1:].strip()

Recommended Answers

All 8 Replies

You can use a tuple of the first 3 columns as a dictionary index.

f_list=open(sys.argv[1]).readlines()

## you can use a set or dictionary
unique_dic = {}
for rec in f_list:
    sub_str = rec.split()
    key = (sub_str[0], sub_str[1], sub_str[2])
    if key not in unique_dic:
        unique_dic[key] = 1

for rec in open(sys.argv[2]):
    sub_str = rec.split()
    key = (sub_str[0], sub_str[1], sub_str[2])
    if key in unique_dic:
        print rec

So i did it this way. The problem that I have now is that is printing i and j in different lines and i need to print the matching lines in one line on the output.

I tried the way that it was posted and it kind of have the same problem on the output file.

#! /usr/bin/python

import sys

f1 = file(sys.argv[1])
f2 = file(sys.argv[2])

f1line = f1.readlines()
f2line = f2.readlines()

f1.close()
f2.close()


for i in f1line:
  for j in f2line:
    if (i.split()[0] == j.split()[0]) and (i.split()[1] == j.split()[1] and i.split()[2]==j.split()[2]):
      print i, j

"i" and "j" (terrible names for variables, "i" can look like a one, and neither is descriptive) still have the newline character from the file, so you want to strip them.

print i.strip(), j.strip()
##
##   also, eliminate the multiple splits
for i in f1line:
    split_i = i.split()
    for j in f2line:
        split_j = j.split()
        if (split_i[0] == split_j[0]) and (split_i[1] == split_j[1] and \
           (split_i[2]==split_j[2]):
            print i.strip(), j.strip()
#
#     or you can use a loop, which may be easier to read
#
for i in f1line:
    split_i = i.split()
    for j in f2line:
        split_j = j.split()
        match = True
        for k in range(0, 3):
            if split_i[k] != split_j[k]:
                match = False
        if match:
            print i.strip(), j.strip()

Thanks!!! That was very helpful.
Although I still have a problem. The program runs fine, except that I tried in my files (instead of test files) and it turns out that since my files are very long (about 600000 lines), it takes the program forever to run.
Any ideas on how can I make it more efficient?

Thanks again. I am learning a lot and enjoying it :)

Ok, I have an idea. I just don't know how to do it well.
I can use the original program that compares just the first column of the line. I just need to concatenate the first 3 columns so it makes one string to compare.
I would have to use something like zfill() to complete with zeros, so it doesn't overlap or pick wrong lines. And I have to concatenate by using + in between the strings. Do I have to make the columns go from int to str?

Cheers!!!

Do I have to make the columns go from int to str?

No, use a tuple, as in my previous post, which can contain a mix of strings and integers. And comparing integers is usually superior to comparing strings, so if record[1] is an integer for example, the tuple would be built with
key1_tuple = (record[0], int(record[1]), record[2])

it takes the program forever to run. Any ideas on how can I make it more efficient?

Comparing tuples should be faster, but disk I/O is usually the time waster. Take a look at Parallel Python if you are on a multi-core machine. It seems pretty straight-forward.

So, this is what I ended doing. It works and is really efficient :)
Hopes helps somebody later on.
Cheers and thanks for all the help

#! /usr/bin/python

import sys

f1 = file(sys.argv[1])
f2 = file(sys.argv[2])

d1 = {}
for l1 in f1:

  
 k = l1.split()

 key = k[0]+k[1].zfill(4)+k[2].zfill(4)
 d1[key] = l1



d2 = {}
for l2 in f2:

  k2 = l2.split()
  key2 = k2[0]+k2[1].zfill(4)+k2[2].zfill(4)

  d2[key2]=l2

  
for i in d1.keys():
  if d2.has_key(i):
    print d1[i].strip()

Or you might do something like (with statement need version 2.7 or later, otherwise change it to nested withs)

#! /usr/bin/python
from __future__ import print_function
import sys
import time

def produce_key(line):
    return '%s%04s%04s' % tuple(line.split(None, 3)[:3])

t0 = time.time()

with open(sys.argv[1]) as firstfile, open(sys.argv[2]) as secondfile:
        hashedfile = {produce_key(line) : line for line in firstfile}
        second = {produce_key(line) for line in secondfile}

print (''.join(hashedfile[key] for key in hashedfile if key in second))
print (1000*(time.time()-t0), 'ms')
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.