Hi All

I have two tab-delimited files. I want to compare first column of testfile1 to first column of testfile2 to find if items in file2 are in file1 and write it to the new file.

I have a following code but not working!! :(.

f1 = open('testfile1.txt')
#f2 = open('testfile2.txt')

for line in f1:
    a = line.split()
    list1 = a[0].split()
    print list1
print "now printing list 2"
    f2 = open('testfile2.txt')
    for line in f2:
        b = line.split()
        list2 = b[0].split()
        print list2
    
    for i,e1 in enumerate(list1):
        for e2 in (list2):
            if e2 in e1:
                print ("line %d : %s" % (i,e1))
            else:
                print"No matching entries"
             
f1.close()
f2.close()

Any helps?

Try this code, which uses list comprehensions

from pprint import pprint

def records(filename):
    """generates pairs (word, line) from the file, where word is the first column"""
    return ((line[:line.find('\t')], line) for line in open(filename))

L1 = list(records('testfile1.txt'))
D1 = dict(L1)

assert(len(L1) == len(D1)) # check that keys are unique in the first file.

pprint(D1)

result = [(word, line) for (word, line) in records('testfile2.txt') if word in D1]

pprint(result)

Try this code, which uses list comprehensions

from pprint import pprint

def records(filename):
    """generates pairs (word, line) from the file, where word is the first column"""
    return ((line[:line.find('\t')], line) for line in open(filename))

L1 = list(records('testfile1.txt'))
D1 = dict(L1)

assert(len(L1) == len(D1)) # check that keys are unique in the first file.

pprint(D1)

result = [(word, line) for (word, line) in records('testfile2.txt') if word in D1]

pprint(result)

Thanks a lot for the code.! Working very well.

Is there any other.. simple way to do it. as a beginner it seems bit complicated to rewrite the code on own.

Helps greatly appreciated!

Thanks a lot for the code.! Working very well.

Is there any other.. simple way to do it. as a beginner it seems bit complicated to rewrite the code on own.

Helps greatly appreciated!

Yes, you could write it this way

def records(filename):
    for line in open(filename):
        index = line.find('\t')
        word = line[:index]
        yield (word, line)

D1 = dict(records('testfile1.txt'))

for word, line in records('testfile2.txt'):
    if word in D1:
        # etc... do something

You should learn about the yield statement if you don't know it yet. It is very powerful !

You can split records from \t and put first records to set and then same from other file and for each record which is in second file which has same key field put in result. Gribouillis did nicely that he checked for uniqueness of the keys as text file is not database and does not enforce uniqueness.

It is good practise to make generators/list comprehensions as they are pythonic and efficient way of doing things. If you do not like them, you can change them to normal loops easy enough.

result = (line.split()
          for  key in set(line.split('\t',1)[0] for line in open('testfile1.txt'))
          for  line in open('testfile2.txt')
          if line.startswith(key+'\t')
          )
for same in sorted(result):
    print(same)

You can split records from \t and put first records to set and then same from other file and for each record which is in second file which has same key field put in result. Gribouillis did nicely that he checked for uniqueness of the keys as text file is not database and does not enforce uniqueness.

It is good practise to make generators/list comprehensions as they are pythonic and efficient way of doing things. If you do not like them, you can change them to normal loops easy enough.

result = (line.split()
          for  key in set(line.split('\t',1)[0] for line in open('testfile1.txt'))
          for  line in open('testfile2.txt')
          if line.startswith(key+'\t')
          )
for same in sorted(result):
    print(same)

Fantastic!!!!!!!! easy and worked so welll.. ! perfect.

Thank you all!

Notice though that after the loop the result generator is empty. If you need the values many times change result from generator to list comprehension by changing outer () to [].