| | |
Comparing two files line by line
Thread Solved |
•
•
Join Date: Oct 2009
Posts: 4
Reputation:
Solved Threads: 0
Hi,
I am comparing 2000 files with one other file. I want the program to go through each line in both files and compare. If the line is present, then it has to write to another file. What I tried was to open both the files and use readlines() to read into an list. Then I used for loop like this:
The above code is only a chunk btw. My problem is that both the files are not of the same size so I feel using zip or izip is not ideal in this situation. A part or the file I have to deal with is below:
The only part that is common to both files is the one in bold (the above is just a chunk of a code). So ideally I am supposed to compare the bold data from file 1 and if it exists in file 2, I have to retain it and remove the remaining data.
For e.g.:
If the above entry is there in both files then I gotto retain it in file-2 and remove all other entries. I tried to add the required list position to the sample code you gave me but I failed to get the results. Please let me know if I can differentiate the above data and if so how can I do it? I tried the same in perl and I am able to do it very easily but the same in python is becoming tougher for me as I am very new to python (learning for the past week or so)
Cheers,
Chav
I am comparing 2000 files with one other file. I want the program to go through each line in both files and compare. If the line is present, then it has to write to another file. What I tried was to open both the files and use readlines() to read into an list. Then I used for loop like this:
Python Syntax (Toggle Plain Text)
chain_sep=[] complex_file=open ("1complex.txt", "r") complex_lines = complex_file.readlines() complex_lines = map(string.strip, complex_lines) splitter = [s.split('\t') for s in complex_lines] complex_file.close() for file in os.listdir("."): basename=os.path.basename(file) if basename.endswith(".pd"): chain_sep.append(basename) for (i,s) in izip(chain_sep,splitter): fhandle_6 =open (i, "r") from_pd = fhandle_6.readlines() from_pd = map(string.strip,from_pd) fhandle_6.close() fhandle_13 = open(s[0]+".cr", 'r') fhandle_13_l = fhandle_13.readlines() fhandle_13_l = map(string.strip, fhandle_13_l) fhandle_13.close() fopen_7=open (i+"r.pdb", "w") fopen_8=open (i+"l.pdb", "w") for (a,y) in izip(from_pd,fhandle_13_l): #from_pd and fhandle_13_l is not of the same length :( if a[0:4]=="ATOM": if a[21] == "R": print >>fopen_7, a else: if a[7:13]==y[7:13]: print >>fopen_8, a fopen_7.close() fopen_8.close()
file-1 ATOM 2197 CB CYS I 51 38.091 -13.002 6.320 1.00 20.12 ATOM 2198 SG CYS I 51 39.781 -12.827 5.691 1.00 26.67 ATOM 2199 N MET I 52 37.845 -15.766 5.722 1.00 33.08 ATOM 2200 CA MET I 52 38.312 -17.144 5.674 1.00 33.08
file-2 ATOM 2197 O ASP L 50 18.653 89.329 84.802 1.00 0.00 ATOM 2198 CB ASP L 50 16.004 87.278 84.523 1.00 0.00 ATOM 2199 CG ASP L 50 15.349 86.109 85.277 1.00 0.00 ATOM 2200 OD1 ASP L 50 15.347 85.935 86.514 1.00 0.00
For e.g.:
CB CYS I 51 CB CYS I 51
Cheers,
Chav
0
#2 19 Days Ago
I think you could try something like this
(I used "file1", "file2" for the source files and "file3" for the destination file).
python Syntax (Toggle Plain Text)
def key(line): return tuple(line.strip().split()[2:6]) def make_key_set(file_path): return set(key(line) for line in open(file_path)) def filtered_lines(file_path1, file_path2): key_set = make_key_set(file_path2) return (line for line in open(file_path1) if key(line) in key_set) if __name__ == "__main__": file3 = open("file3", "w") for line in filtered_lines("file1", "file2"): file3.write(line) file3.close()
Last edited by Gribouillis; 19 Days Ago at 8:28 am.
•
•
Join Date: Oct 2009
Posts: 4
Reputation:
Solved Threads: 0
0
#3 19 Days Ago
Hi,
Thanks for the help but what does file_path mean? Should I set it to the file name? Sorry for sounding like an idiot but I am too new to python. Also I have to iterate this through 10vs2000 files. I mean there are 10 files1 and 2000 file2 so using a for loop will help? Or is there any other way?
Again sorry if I sound ignorant but I am a newbie to python.
Cheers
Thanks for the help but what does file_path mean? Should I set it to the file name? Sorry for sounding like an idiot but I am too new to python. Also I have to iterate this through 10vs2000 files. I mean there are 10 files1 and 2000 file2 so using a for loop will help? Or is there any other way?
Again sorry if I sound ignorant but I am a newbie to python.
Cheers
0
#4 19 Days Ago
If you have 10 file1 and 2000 file2, it makes 20000 file3 (output files). You should explain clearly what you want:
* Are the file1 in a separate directory ? How do you recognize that a file is a file1 ?
* Same questions for the file2s
* Do you really want 20000 output files ? What should be their names ?
Should they be in separates directories ?
Also I think I interverted file1 and file2 in the code. What should the output file contain:
* the lines from file1 which don't appear in file2 OR
* the lines from file2 which don't appear in file 1 ?
* Are the file1 in a separate directory ? How do you recognize that a file is a file1 ?
* Same questions for the file2s
* Do you really want 20000 output files ? What should be their names ?
Should they be in separates directories ?
Also I think I interverted file1 and file2 in the code. What should the output file contain:
* the lines from file1 which don't appear in file2 OR
* the lines from file2 which don't appear in file 1 ?
•
•
Join Date: Oct 2009
Posts: 4
Reputation:
Solved Threads: 0
0
#5 19 Days Ago
Hi,
Thanks for the reply and I am really sorry for not making things clear. Please bear with me:
I am supposed to compare 1 file with name 2pka.pdb and content
with 2000 automatically created files from another program named "xyz.pd" containing:
All the 2000 files that are created by another program follow the same format shown above. Now each of these 2000 files has to be compared with the first file. The problem here is I can only use the bolded values to compare both the files. I cannot use other terms as they surely will differ. So my program has to open both the files and compare only the bolded elements and if the comparison is false, it has to remove the line from the second file. Hope you got what I am trying to explain. Waiting for your reply
Cheers,
Thanks for the reply and I am really sorry for not making things clear. Please bear with me:
I am supposed to compare 1 file with name 2pka.pdb and content
ATOM 1802 N PRO L 2 40.689 -17.278 -4.343 1.00 32.48 ATOM 1803 CA PRO L 2 40.760 -16.172 -5.353 1.00 32.48 ATOM 1804 C PRO L 2 42.037 -15.324 -5.293 1.00 32.48 ATOM 1805 O PRO L 2 42.846 -15.431 -4.323 1.00 32.48 ATOM 1806 CB PRO L 2 39.472 -15.307 -5.238 1.00 32.48 ATOM 1807 CG PRO L 2 38.617 -15.946 -4.122 1.00 41.73 ATOM 1808 CD PRO L 2 39.405 -17.165 -3.579 1.00 41.73 ATOM 1809 N ASP L 3 42.334 -14.759 -6.453 1.00 39.27 ATOM 1810 CA ASP L 3 43.625 -14.117 -6.708 1.00 39.27
ATOM 1800 N ARG L 1 29.039 96.534 82.816 1.00 0.00 ATOM 1801 CA ARG L 1 28.016 95.599 82.279 1.00 0.00 ATOM 1802 C ARG L 1 28.732 94.465 81.540 1.00 0.00 ATOM 1803 O ARG L 1 29.697 94.738 80.746 1.00 0.00 ATOM 1804 CB ARG L 1 27.034 96.335 81.376 1.00 0.00 ATOM 1805 CG ARG L 1 26.607 97.675 81.981 1.00 0.00 ATOM 1806 CD ARG L 1 25.840 97.442 83.279 1.00 0.00
Cheers,
1
#6 19 Days Ago
Here is a script which should work for 1 .pdb file and many .pd files. It creates the new versions of the .pd files (with less lines) in a separate output directory. See the usage at the end of the script. Let me know if it works !
python Syntax (Toggle Plain Text)
#!/usr/bin/env python import os from os.path import join as pjoin, isdir, isfile class DataFile(object): def __init__(self, path): self.path = path def lines(self): for line in open(self.path): yield line @staticmethod def key(line): return tuple(line.strip().split()[2:6]) class Pdb(DataFile): def __init__(self, path): DataFile.__init__(self, path) self._keys = None def keys(self): if self._keys is None: self._keys = set(self.key(line) for line in open(self.path)) return self._keys class Pd(DataFile): def __init__(self, path): DataFile.__init__(self, path) def filtered_lines(self, pdb): keys = pdb.keys() for line in self.lines(): if self.key(line) in keys: yield line def dump(self, output, pdb): for line in self.filtered_lines(pdb): output.write(line) output.close() class Output(DataFile): def __init__(self, path): DataFile.__init__(self, path) self.ofile = open(path, "w") def write(self, *args): return self.ofile.write(*args) def close(self): self.ofile.close() class Directory(object): def __init__(self, path): self.path = path def names(self, extension=None): for name in os.listdir(self.path): if extension is None or name.endswith(extension): yield name def process(pdb_file, pd_dir, output_dir): if isdir(output_dir): raise ValueError("Output dir must not exist") else: os.mkdir(output_dir) pdb = Pdb(pdb_file) pd_dir = Directory(pd_dir) for name in pd_dir.names(".pd"): pd = Pd(pjoin(pd_dir.path, name)) output = Output(pjoin(output_dir, name)) pd.dump(output, pdb) def main(): from sys import argv pdb_file, pd_dir, output_dir = argv[-3:] if pdb_file.endswith(".pdb") and isfile(pdb_file) and isdir(pd_dir) and not isdir(output_dir): pass else: raise RuntimeError("""Usage: <executable> pdb_file pd_dir output_dir where * pdb_file: an existing .pdb file * pd_dir: an existing directory containing .pd files * output_dir: a NON existing directory """) process(pdb_file, pd_dir, output_dir) if __name__ == "__main__": main()
![]() |
Similar Threads
- fgets for reading files line by line (C++)
- comparing file in c (C)
- comparing and inserting common line in other file (Shell Scripting)
- Comparing 2 files and then apending matches (Shell Scripting)
- Comparing two files and output values that match (Shell Scripting)
- compile of header, .cpp and program on command line (C++)
Other Threads in the Python Forum
- Previous Thread: Help with beginner question
- Next Thread: python tic tac toe game.
| Thread Tools | Search this Thread |
alarm ansi app assignment avogadro backend beginner binary bluetooth character cmd customdialog cx-freeze data decimals dictionary directory dynamic error exe file float format function generator getvalue gnu graphics halp heads homework http ideas images import input ip itunes java leftmouse line linux list lists loop maintain maze millimeter module mouse number numbers output parsing path pointer prime programming progressbar push py2exe pygame python queue random recursion schedule screensaverloopinactive script scrolledtext slicenotation sqlite ssh statistics string strings sudokusolver sum text thread threading time tlapse tuple tutorial ubuntu unicode url urllib urllib2 variable variables ventrilo vigenere web webservice wikipedia write wxpython xlib





