I am new to python programming and struggling with a problem I would like help with. I have multiple text files that I would like to join using the first column in each file to serve as the key to align the files. Each file could be several hundred lines long. The files SHOULD have the same number of lines. The first line in each file can be omitted in the output file. The may be extra text (Possible superstructure of XX) in one or both files which can be omitted. The first character in the first column can be dropped as well.

File 1 looks like
>SU
>PD-98059 PD-98059 Tanimoto from SU = 0.129213
>BML-265 BML-265 Tanimoto from SU = 0.163743
>BML-257 BML-257 Tanimoto from SU = 0.156627
>SU 4312 SU 4312 Tanimoto from SU = 1
Possible superstructure of SU
>AG-370 AG-370 Tanimoto from SU = 0.264286
>AG-490 AG-490 Tanimoto from SU = 0.347826

File 2 looks like
>GF
>PD-98059 PD-98059 Tanimoto from GF = 0.118483
>BML-265 BML-265 Tanimoto from GF = 0.164179
>BML-257 BML-257 Tanimoto from GF = 0.213904
>SU 4312 SU 4312 Tanimoto from GF = 0.436364
>AG-370 AG-370 Tanimoto from GF = 0.284848
>AG-490 AG-490 Tanimoto from GF = 0.307692

The output file including headers would look like

ID SU GF
PD-98059 0.129213 0.118483
BML-265 0.163743 0.164179
BML-257 0.156627 0.213904
SU 4312 1 0.436364
AG-370 0.264286 0.284848
AG-490 0.347826 0.307692

At this point I would like to join this output file and add a third column and header. I will need to repeat this process many times building a large text file with the number of columns equal to the number of lines. I am trying to build a distance matrix for another application. I hope someone can find this a challenge and offer a solution. Any help will be appreciated.

Hundreds of records is not much in today's world so you can read each file into a dictionary and go from there. A simple example to associate the two files because I am too tired to do more today. You can omit some of the unnecessary records from the dictionary or use as is and filter before writing to the third file.

## simulate 2 files read into lists using readlines()
file_1 = ['SU' ,
'PD-98059 PD-98059 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826',
'PD-98060 PD-98059 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826',
'PD-98061 PD-98060 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826']


file_2 = ['GF',
'PD-98059 PD-98059 Tanimoto from GF = 0.118483',
'BML-265 BML-265 Tanimoto from GF = 0.164179',
'BML-257 BML-257 Tanimoto from GF = 0.213904',
'SU 4312 SU 4312 Tanimoto from GF = 0.436364',
'AG-370 AG-370 Tanimoto from GF = 0.284848',
'AG-490 AG-490 Tanimoto from GF = 0.307692',
'PD-98061 PD-98059 Tanimoto from GF = 0.118483',
'BML-265 BML-265 Tanimoto from GF = 0.164179',
'BML-257 BML-257 Tanimoto from GF = 0.213904',
'SU 4312 SU 4312 Tanimoto from GF = 0.436364',
'AG-370 AG-370 Tanimoto from GF = 0.284848',
'AG-490 AG-490 Tanimoto from GF = 0.307692']
   
def groups(list_in):
    """ break the file into groups of records from "PD" to
        the next "PD"
    """
    return_dict = {}
    group_list = []
    for rec in list_in:
        rec = rec.strip()
        if (rec.startswith("PD")) and (len(group_list)):     ## new group
           dict_in = to_dict(group_list, return_dict)
           group_list = []
        group_list.append(rec)

    ## process the final group
    dict_in = to_dict(group_list, return_dict)

    return return_dict

def to_dict(group_list, dict_in):
    """ add to the dictionary
        key = "PD"+number
        values = list of lists = all records associated with this key
    """
    ## the first record contains the "PD" key
    substrs = group_list[0].split()
    key = substrs[0]
    if key in dict_in:
        print "DUPLICATE record", group_list[0]
    else :
        dict_in[key] = []
        ## add all of the records to the dictionary
        ## including the "PD" record
        for rec in group_list:
            dict_in[key].append(rec)

    return dict_in

ID = file_1[0].strip()     ## "SU"
file_1_dict = groups(file_1[1:])

ID += " " + file_2[0].strip()     ## "GF"
file_2_dict = groups(file_2[1:])

print "ID =", ID
## not printed in any particular order
for key in file_1_dict:
    print key
    for rec in file_1_dict[key]:
        print "  ", rec
    if key in file_2_dict:
        for rec in file_2_dict[key]:
            print "     ", rec     ## additional indent

Here is my start also, but it sorts the lines by alphabetic order input files are supposed to start with same letters (here 'file0') and end with '.txt'.

import os
import itertools
ids = []
lines = []
for fn in (f for f in os.listdir(os.curdir) if f.startswith('file0') and f.endswith('.txt')):
    with open(fn) as infile:
        id=next(infile)[1:].rstrip()
        ids.append(id)
        for line in ((line[1:4]+line[4:].split(None, 1)[0], line.rsplit('=',1)[-1].rstrip()) for line in infile):
            lines.append(line)
        if line[0]==id:
            break # ignore superstructure
lines.sort()
# process or write to file
print 'ID',' '.join(ids)
for group,line in itertools.groupby(lines, key=lambda x: x[0]):
    print group, ' '.join(data for _,data in reversed(list(line)))
    print group, ' '.join(data for _,data in reversed(list(line)))

Thanks woooee and tonyjv. I will try these and report back as soon as I can. Be patient with me as I come up to speed.

Post a link to a sample file that we can use for testing when/if you want to take this tread any further.

I think prefered way here in DaniWeb is to go Advanced view and attach file. If type of file does not fit zip and attach.

I have uploaded two files. One file is the desired output format and the zip files contains 6 typical files of 1,000 plus lines. Untimately I wouldlike to merge multiple files. They would start with a common name of mostsim_*.txt. I hope this help in testing your code. Thank you.

Looks OK for me:

import os
import itertools
ids = []
lines = []
for fn in (f for f in os.listdir(os.curdir) if f.startswith('MostSim') and f.endswith('.txt')):
    with open(fn) as infile:
        name = next(infile)[1:].rstrip()
        ids.append(name)
        for line in ((line[1:].split(None, 1)[0],
                        line.rsplit('= ',1)[-1].rstrip())
                            for line in infile if line.startswith('>')):
                lines.append((name,line))
        if line[0] == name:
            break # ignore superstructure
lines.sort(key = lambda x: x[1][0])
lines = [(a, list(b)) for a, b in itertools.groupby(lines, key=lambda x: x[1][0])]
#print 'Lines begin',lines[0] # debug
# process or write to file
with open('outp_mostsim.txt','w') as outp:
  outp.write('ID\t'+'\t\t'.join(sorted(ids))+'\n')
  for group,line in lines:
      outp.write('%s\t%s' % (group,'\t'.join(b[1] for a,b in sorted(list(line)))+'\n'))

Thank you! This works VERY well and just what I needed. What is the significance of the "extra" indentations on line 10? I will work with this and study it. I appreciate your time and help.

It is one line list comprehension which is divided in multiple lines for clarity. You can continue lines, when the expression is in parenthesis or square brackets without line continuation sign \.

It is one line list comprehension which is divided in multiple lines for clarity. You can continue lines, when the expression is in parenthesis or square brackets without line continuation sign \.

Should not post early at morning: I meant it is generator expression split multiple lines the way IDLE environment likes to do it automatically. It is to make expression more readable as wide lines are nasty to read and understand.