I am stuck with large text files which I have to merge and further work with my model.

I tried to follow the previous thread in text merging

Here is the script which I used:

one = open("one.txt",'r')
two = open("two.txt",'r')

ofh = open("out.txt",'w')

# read in the first file and create a dict: 

d = {}

for line in one:
    # we remove the newline and split the line 
    k = line.strip().split('\t')
    d = k[0]


for line in two:
    result = ""

    h= line.strip().split('\t')[0]
    # for every character in the line ...
    for i in h:
        # check if there is an entry for it in the dict
        if i in d:
            # if yes, add the value to the result-string
            result += d[i]
        else: pass
        # print the result
        print>>ofh, result



but keep getting error message and could not go further. I think I have not got the main point in the whole process so I am lost.

file one looks:
3 Germany WW
3 Germany BR
3 Germany DR
4 France PR
4 France ST
> it has over 2 million rows of diffrent countries

file two looks:
1 UK 2.3 3.1 5.3
2 US 3.3 3.4 2.3
3 Germnay 1.3 5.1 4.1
4 France 2.3 3.1 3.3
> file two has about 2 thousands entries

The two files are to be combined based on the first coloumns!

The output file should look like:
1 UK IR 2.3 3.1 5.3
1 UK SC 2.3 3.1 5.3
2 US WS 3.3 3.4 2.3
2 US CL 3.3 3.4 2.3
2 US ND 3.3 3.4 2.3
2 US TX 3.3 3.4 2.3
2 US NB 3.3 3.4 2.3
2 US SC 3.3 3.4 2.3
3 Germany WW 1.3 5.1 4.1
3 Germany BR 1.3 5.1 4.1
3 Germany DR 1.3 5.1 4.1
4 France PR 2.3 3.1 3.3
4 France ST 2.3 3.1 3.3

The files are larg grid files converted into text and have to be combined. It does not matter if all content of file two are joined with file one.

Any suggestions is greatly appreciated. I am stucked with this and could not go further with my work.


Well, first in both files, each line starts with a number which seems to identify the country, then the country name, then the rest. First write a function to parse a line

import re
pattern = re.compile(r"(\d)+\s+(\w+)")

def parseLine(line):
    "returns a triple (country_number, country_name, rest)"
    match = pattern.match(line)
    return match.group(1), match.group(2), line[match.end():].strip()

Now since file 2 has only 2000 lines, it's content can be loaded into memory. We create a mapping country_number --> entry with the second file (we could also take the country name as the key, but let's assume that the number is unique for each country)

def createMapping(pathTwo):
    source = open(pathTwo)
    dic = dict()
    for line in source:
        dic[number] = Entry(line)
    return dic

class Entry(object):
    def __init__(self, line):
        self.number, self.name, self.data = parseLine(line)

Now we read the other file, and for each line read, we complete the line using our dictionary and we write to an output file

def merge(pathOne, pathTwo, pathOut):
    mapping = createMapping(pathTwo)
    source = open(pathOne)
    out = open(pathOut, "w")
    for line in source:
        number, country, rest = parseLine(line)
        if not number in mapping:
           raise Exception("unkown country number '%s'" % line.strip())
        out.write(" ")

if __name__ == "__main__":
    merge("one.txt", "two.txt", "out.txt")

Thank you Gribouillis for your quick reply.

I get the following message and dont know how to correct it :(

Traceback (most recent call last):
  File "E:\Python_project\Python2\merge_file.py", line 49, in <module>
    merge("one.txt", "two.txt", "out.txt")
  File "E:\Python_project\Python2\merge_file.py", line 36, in merge
    mapping = createMapping(pathTwo)
  File "E:\Python_project\Python2\merge_file.py", line 28, in createMapping
    dic[number] = Entry(line)
NameError: global name 'number' is not defined


sorry, change the function to

def createMapping(pathTwo):
    source = open(pathTwo)
    dic = dict()
    for line in source:
        e = Entry(line)
        dic[e.number] = e
    return dic

It is working for very small files!!!

when the file size increase say for 200 rows it starts to merge non-similar ID's.

can it be improved?

There are different questions:
1/ does each country have a unique ID number ?
2/ does each ID number apply to a unique country ?
3/ does each country appear only once in file two ?
4/ does each country from file one have a record in file two ?
A possible modification is to use the country's name as the dictionary key (you replace dic[e.number] = e by dic[e.name] = e and mapping[number] by mapping[country] . Depending on the answers to the above questions, other things could be modified.

The answere is yes to the three questions.

1. No. the unique ID is for a place in a country not for the country so a country dont have a unique ID. So we can't use the country as as dictionary key.

There are a large number of grid points (consider them as small plot of lands) in a country and each have one unique ID.

2. Each ID apply to a unique place (grid point) so it is unique to that point.

3. Yes. The unique ID appears only once in file two
4. Yes. all points in file one have a corresponding record in file two.

One record in file two can correspond upto 36 ID's in file one.

Depending on some cases file two may have upto 700 columns

I see the mistake I made. I didn't consider the 3rd item, for example in 3 Germany WW , I forget to take the WW into account. I'll modify the program. However, if file2 can have up to 700 colmuns, we could try not to load the whole content in mermory. For this, the question is:
are the files sorted ?

Here is a new version. I tested it with the files that you gave before and it works. It doesn't load the files. Try it with more substantial input :)

import re
pattern = re.compile(r"(\d+)\s+(\w+)\s+")

class FileTwo(object):
  def __init__(self, pathTwo):
    self.path = pathTwo
    self.ifile = open(pathTwo)
    self.pos = dict()
    # read the file and create a dictionary  ID -> position in file
    curpos = self.ifile.tell()
    while True:
      line = self.ifile.readline()
      if not line:
      if not line.strip():
      index = line.index(" ")
      ID = int(line[:index])
      if ID in self.pos:
	raise Exception("ID %d appears twice in file '%s'" % (ID, self.path))
      self.pos[ID] = curpos
      curpos = self.ifile.tell()
  def __getitem__(self, ID):
    # self[ID] returns the line corresponding to the given ID
    if ID not in self.pos:
      raise Exception("ID %d doesn't exist in file '%s'" % (ID, self.path))
    return self.ifile.readline().strip()

def merge(pathOne, pathTwo, pathOut):
  file2 = FileTwo(pathTwo)
  out = open(pathOut, "w")
    for line in open(pathOne):
      index = line.index(" ")
      ID = int(line[:index])
      line2 = file2[ID]
      match = pattern.match(line2)
      rest = line2[match.end():]
      out.write("%s %s\n" % (line.strip(), rest))

if __name__ == "__main__":
  merge("one.txt", "two.txt", "out.txt")

I found the error: the (\d)+ in the pattern should have been (\d+) . I corrected it in the previous post ! So the code failed with ID's having more than 1 digit.

Thanks a lot!

I will check it and let you know!!