Comparing two files line by line

Question

chavanak 0 Newbie Poster

15 Years Ago

Hi,
I am comparing 2000 files with one other file. I want the program to go through each line in both files and compare. If the line is present, then it has to write to another file. What I tried was to open both the files and use readlines() to read into an list. Then I used for loop like this:

chain_sep=[]
complex_file=open ("1complex.txt", "r") 
complex_lines = complex_file.readlines()
complex_lines = map(string.strip, complex_lines)
splitter = [s.split('\t') for s in complex_lines]
complex_file.close()      

for file in os.listdir("."):
    basename=os.path.basename(file)
    if basename.endswith(".pd"):
        chain_sep.append(basename)
for (i,s) in izip(chain_sep,splitter):
    fhandle_6 =open (i, "r")
    from_pd = fhandle_6.readlines()
    from_pd = map(string.strip,from_pd)
    fhandle_6.close()
    fhandle_13 = open(s[0]+".cr", 'r')
    fhandle_13_l = fhandle_13.readlines()
    fhandle_13_l = map(string.strip, fhandle_13_l)
    fhandle_13.close()
    fopen_7=open (i+"r.pdb", "w")
    fopen_8=open (i+"l.pdb", "w")
    for (a,y) in izip(from_pd,fhandle_13_l): #from_pd and fhandle_13_l is not of the same length :(
    if a[0:4]=="ATOM":
        if a[21] == "R":
            print >>fopen_7, a
        else:
            if a[7:13]==y[7:13]:
          print >>fopen_8, a
fopen_7.close()
fopen_8.close()

The above code is only a chunk btw. My problem is that both the files are not of the same size so I feel using zip or izip is not ideal in this situation. A part or the file I have to deal with is below:

file-1
ATOM   2197  [b]CB  CYS I  51[/b]      38.091 -13.002   6.320  1.00 20.12
ATOM   2198  [b]SG  CYS I  51[/b]      39.781 -12.827   5.691  1.00 26.67
ATOM   2199  [b]N   MET I  52[/b]      37.845 -15.766   5.722  1.00 33.08
ATOM   2200  [b]CA  MET I  52[/b]      38.312 -17.144   5.674  1.00 33.08

file-2
ATOM   2197  [b]O   ASP L  50[/b]      18.653  89.329  84.802  1.00  0.00
ATOM   2198  [b]CB  ASP L  50[/b]      16.004  87.278  84.523  1.00  0.00
ATOM   2199  [b]CG  ASP L  50[/b]      15.349  86.109  85.277  1.00  0.00
ATOM   2200  [b]OD1 ASP L  50[/b]      15.347  85.935  86.514  1.00  0.00

The only part that is common to both files is the one in bold (the above is just a chunk of a code). So ideally I am supposed to compare the bold data from file 1 and if it exists in file 2, I have to retain it and remove the remaining data.
For e.g.:

[b]CB  CYS I  51[/b]
[b]CB  CYS I  51[/b]

If the above entry is there in both files then I gotto retain it in file-2 and remove all other entries. I tried to add the required list position to the sample code you gave me but I failed to get the results. Please let me know if I can differentiate the above data and if so how can I do it? I tried the same in perl and I am able to do it very easily but the same in python is becoming tougher for me as I am very new to python (learning for the past week or so)
Cheers,
Chav

python

2 Contributors
6 Replies
433 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by chavanak

Gribouillis 1,391 Programming Explorer

15 Years Ago

I think you could try something like this

def key(line):
    return tuple(line.strip().split()[2:6])

def make_key_set(file_path):
    return set(key(line) for line in open(file_path))


def filtered_lines(file_path1, file_path2):
    key_set = make_key_set(file_path2)
    return (line for line in open(file_path1) if key(line) in key_set)

if __name__ == "__main__":
    file3 = open("file3", "w")
    for line in filtered_lines("file1", "file2"):
        file3.write(line)
    file3.close()

(I used "file1", "file2" for the source files and "file3" for the destination file).

Edited 15 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

15 Years Ago

Here is a script which should work for 1 .pdb file and many .pd files. It creates the new versions of the .pd files (with less lines) in a separate output directory. See the usage at the end of the script. Let me know if it works !

#!/usr/bin/env python

import os
from os.path import join as pjoin, isdir, isfile

class DataFile(object):
    def __init__(self, path):
        self.path = path

    def lines(self):
        for line in open(self.path):
            yield line 

    @staticmethod
    def key(line):
        return tuple(line.strip().split()[2:6])

class Pdb(DataFile):

    def __init__(self, path):
        DataFile.__init__(self, path)
        self._keys = None

    def keys(self):
        if self._keys is None:
            self._keys = set(self.key(line) for line in open(self.path))
        return self._keys

class Pd(DataFile):
    def __init__(self, path):
        DataFile.__init__(self, path)

    def filtered_lines(self, pdb):
        keys = pdb.keys()
        for line in self.lines():
            if self.key(line) in keys:
                yield line

    def dump(self, output, pdb):
        for line in self.filtered_lines(pdb):
            output.write(line)
        output.close()

class Output(DataFile):
    def __init__(self, path):
        DataFile.__init__(self, path)
        self.ofile = open(path, "w")

    def write(self, *args):
        return self.ofile.write(*args)

    def close(self):
        self.ofile.close()

class Directory(object):
    def __init__(self, path):
        self.path = path

    def names(self, extension=None):
        for name in os.listdir(self.path):
            if extension is None or name.endswith(extension):
                yield name

def process(pdb_file, pd_dir, output_dir):
    if isdir(output_dir):
        raise ValueError("Output dir must not exist")
    else:
        os.mkdir(output_dir)
    pdb = Pdb(pdb_file)
    pd_dir = Directory(pd_dir)
    for name in pd_dir.names(".pd"):
        pd = Pd(pjoin(pd_dir.path, name))
        output = Output(pjoin(output_dir, name))
        pd.dump(output, pdb)

def main():
    from sys import argv
    pdb_file, pd_dir, output_dir = argv[-3:]
    if pdb_file.endswith(".pdb") and isfile(pdb_file) and isdir(pd_dir) and not isdir(output_dir):
        pass
    else:
        raise RuntimeError("""Usage:
<executable> pdb_file pd_dir output_dir
where
  * pdb_file: an existing .pdb file
  * pd_dir: an existing directory containing .pd files
  * output_dir: a NON existing directory
""")
    process(pdb_file, pd_dir, output_dir)

if __name__ == "__main__":
    main()

chavanak commented: One person whom I was able to rely on!! +0

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

chavanak 0 Newbie Poster · Answer 1 · 2009-11-04T18:51:56+00:00

Hi,
Thanks for the help but what does file_path mean? Should I set it to the file name? Sorry for sounding like an idiot but I am too new to python. Also I have to iterate this through 10vs2000 files. I mean there are 10 files1 and 2000 file2 so using a for loop will help? Or is there any other way?
Again sorry if I sound ignorant but I am a newbie to python.
Cheers

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2009-11-04T20:36:01+00:00

If you have 10 file1 and 2000 file2, it makes 20000 file3 (output files). You should explain clearly what you want:
* Are the file1 in a separate directory ? How do you recognize that a file is a file1 ?
* Same questions for the file2s
* Do you really want 20000 output files ? What should be their names ?
Should they be in separates directories ?
Also I think I interverted file1 and file2 in the code. What should the output file contain:
* the lines from file1 which don't appear in file2 OR
* the lines from file2 which don't appear in file 1 ?

chavanak 0 Newbie Poster · Answer 3 · 2009-11-05T04:08:48+00:00

Hi,
Thanks for the reply and I am really sorry for not making things clear. Please bear with me:
I am supposed to compare 1 file with name 2pka.pdb and content

ATOM   1802  [B]N   PRO L   2[/B]      40.689 -17.278  -4.343  1.00 32.48 
ATOM   1803  [B]CA  PRO L   2[/B]      40.760 -16.172  -5.353  1.00 32.48 
ATOM   1804  [B]C   PRO L   2[/B]      42.037 -15.324  -5.293  1.00 32.48 
ATOM   1805  [B]O   PRO L   2[/B]      42.846 -15.431  -4.323  1.00 32.48 
ATOM   1806  [B]CB  PRO L   2 [/B]     39.472 -15.307  -5.238  1.00 32.48 
ATOM   1807  [B]CG  PRO L   2[/B]      38.617 -15.946  -4.122  1.00 41.73 
ATOM   1808  [B]CD  PRO L   2[/B]      39.405 -17.165  -3.579  1.00 41.73 
ATOM   1809  [B]N   ASP L   3[/B]      42.334 -14.759  -6.453  1.00 39.27 
ATOM   1810  [B]CA  ASP L   3[/B]      43.625 -14.117  -6.708  1.00 39.27

with 2000 automatically created files from another program named "xyz.pd" containing:

ATOM   1800  [B]N   ARG L   1[/B]      29.039  96.534  82.816  1.00  0.00 
ATOM   1801  [B]CA  ARG L   1[/B]      28.016  95.599  82.279  1.00  0.00 
ATOM   1802  [B]C   ARG L   1[/B]      28.732  94.465  81.540  1.00  0.00 
ATOM   1803  [B]O   ARG L   1[/B]      29.697  94.738  80.746  1.00  0.00 
ATOM   1804  [B]CB  ARG L   1[/B]      27.034  96.335  81.376  1.00  0.00 
ATOM   1805  [B]CG  ARG L   1[/B]      26.607  97.675  81.981  1.00  0.00 
ATOM   1806  [B]CD  ARG L   1[/B]      25.840  97.442  83.279  1.00  0.00

All the 2000 files that are created by another program follow the same format shown above. Now each of these 2000 files has to be compared with the first file. The problem here is I can only use the bolded values to compare both the files. I cannot use other terms as they surely will differ. So my program has to open both the files and compare only the bolded elements and if the comparison is false, it has to remove the line from the second file. Hope you got what I am trying to explain. Waiting for your reply
Cheers,

chavanak 0 Newbie Poster · Answer 4 · 2009-11-06T01:50:32+00:00

Thanks a ton!!! It worked though I had to change a few minor things!!!
Thanks again mate :)