Hi All

Hope everyones doing good.

I have two \t files with 3 columns, file1 contains 600050 rows and file2 contains 11221133 rows.

I am comparing file2 with file1 to match common entries in first two columns, if file1[0:2] in file2[0:2 ,] write file2[0:2]+column 3 else fil1[0:2] + 5.

I did this by using two dictionaries with elements of column[0:2] is key and column[3] as value and but loading of file2 in dictionary gave a memory error.

dict2[item1,item2] = item3
MemoryError

I also tried list but memory error, sets works but huge duplicates and unordered. I want to preserve the order as same as file1.

find attached test files.

f1 = open(file1) 
    f2 = open(file2)   # very very large file
    f3 = open(file3,'w')

    dict1 = {}
    dict2 = {}
    for line in f1:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict1[item1,item2] = item3

    for line in f2:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict2[item1,item2] = item3        # here its giving memory error

    for item in dict1.keys():
        if item in dict2:
            match = item[0] +'\t'+ item[1] + '\t' + dict2[item] + '\n'
            f3.write(data)
        else:
            data= item[0] +'\t'+ item[1] + '\t' + str(0) + '\n'
            f3.write(data)
    f1.close()
    f2.close()
    f3.close()
Attachments
992	3096	1
992	225	1
992	1780	1
992	1523	1
991	786	1
991	760	1
991	3095	1
991	1712	1
991	1528	1
988	646	1
988	2279	1
98	903	1
98	487	1
98	45	1
98	393	1
98	387	1
98	3432	1
98	3199	1
98	3134	1
98	3032	1
98	2916	1
98	2891	1
98	2802	1
98	2668	1
98	2595	1
98	2580	1
98	2540	1
98	2043	1
98	1952	1
98	1917	1
98	1847	1
98	1586	1
98	1485	1
98	1446	1
98	1417	1
98	1385	1
98	1060	1
98	105	1
968	3141	1
968	3	1
967	997	1
967	982	1
967	946	1
967	939	1
967	914	1
967	913	1
967	901	1
967	890	1
967	872	1
967	870	1
967	861	1
967	858	1
967	81	1
967	806	1
967	767	1
967	715	1
967	707	1
967	681	1
967	672	1
967	652	1
967	615	1
967	61	1
967	60	1
967	598	1
967	581	1
967	565	1
967	537	1
967	49	1
967	487	1
992	3096	1
992	225	1
992	1780	1
992	1523	1
991	786	1
991	760	1
991	3095	1
107	1282	0
107	1281	0
107	1280	0
107	128	0
107	1279	0
107	1278	0
107	1277	0
107	1276	0
107	1275	0
107	1274	0
107	1273	0
107	1272	0
107	1271	0
107	1270	0
107	127	0
107	1269	0
107	1268	0
107	1267	0
107	1266	0
107	1265	0
107	1264	0
107	1263	0
107	1262	0
107	1261	0
107	1260	0
107	126	0
107	1259	0
107	1258	0
107	1257	0
107	1256	0
107	1255	0
107	1254	0
107	1253	0
107	1252	0
107	1251	0
107	1250	0
107	125	0
107	1249	0
107	1248	0
107	1247	0
107	1246	0
107	1245	0
107	1244	0
107	1243	0
107	1242	0
107	1241	0
107	1240	0
107	124	0
107	1239	0
107	1238	0
107	1237	0
107	1236	0
107	1235	0
107	1234	0
107	1233	0
107	1232	0
107	1231	0
107	1230	0
107	123	0
107	1229	0
107	1227	0
107	1226	0
107	1225	0
107	1223	0
107	1222	0
107	1221	0
107	1220	0
107	122	0
107	1219	0
107	1218	0
107	1217	0
107	1216	0
107	1215	0
107	1214	0
107	1213	0
107	1212	0
107	1211	0
107	1210	0
107	121	0
107	1209	0
107	1208	0
107	1207	0
107	1206	0
107	1205	0
107	1204	0
107	1203	0
107	1202	0
107	1201	0
107	1200	0
107	120	0
107	12	0
107	1199	0
107	1198	0
107	1197	0
107	1196	0
107	1195	0
107	1194	0
107	1193	0
107	1192	0
107	1191	0
107	1190	0
107	119	0
107	1189	0
107	1188	0
107	1187	0
107	1186	0
107	1185	0
107	1184	0
107	1183	0
107	1182	0
107	1181	0
107	1180	0
107	118	0
107	1179	0
107	1178	0
107	1177	0
107	1176	0
107	1175	0
107	1174	0
107	1173	0
107	1172	0
107	1171	0
107	1170	0
107	117	0
107	1169	0
107	1168	0
107	1167	0
107	1166	0
107	1165	0
107	1164	0
107	1163	0
107	1162	0
107	1161	0
107	1160	0
107	116	0
107	1159	0
107	1158	0
107	1157	0
107	1156	0
107	1155	0
107	1153	0
107	1152	0
107	1151	0
107	1150	0
107	115	0
107	1149	0
107	1148	0
107	1147	0
107	1146	0
107	1145	0
107	1144	0
107	1143	0
107	1142	0
107	1141	0
107	1140	0
107	114	0
107	1139	0
107	1138	0
107	1137	0
107	1136	0
107	1135	0
107	1134	0
107	1133	0
107	1132	0
107	1131	0
107	1130	0
107	113	0
107	1129	0
107	1128	0
107	1127	0
107	1126	0
107	1125	0
107	1124	0
107	1123	0
107	1122	0
107	1121	0
107	1120	0
107	112	0
107	1119	0
107	1118	0
107	1117	0
107	1116	0
107	1115	0
107	1114	0
107	1113	0
107	1112	0
107	1111	0
107	1110	0
107	111	0
107	1109	0
107	1108	0
107	1107	0
107	1106	0
107	1105	0
107	1104	0
107	1103	0
107	1102	0
107	1101	0
107	1100	0
107	110	0
107	11	0
107	1099	0
107	1098	0
107	1097	0
107	1096	0
107	1095	0
107	1094	0
107	1093	0
107	1092	0
107	1091	0
107	1090	0
107	109	0
107	1089	0
107	1088	0
107	1087	0
107	1086	0
107	1085	0
107	1084	0
107	1083	0
107	1082	0
107	1081	0
107	1080	0
107	108	0
107	1079	0
107	1078	0
107	1077	0
107	1076	0
107	1075	0
107	1074	0
107	1073	0
107	1072	0
107	1071	0
107	1070	0
107	1069	0
107	1068	0
107	1067	0
107	1066	0
107	1065	0
107	1064	0
107	1063	0
107	1062	0
107	1060	0
107	106	0
107	1059	0
107	1058	0
107	1057	0
107	1056	0
107	1055	0
107	1054	0
107	1053	0
107	1052	0
107	1051	0
991	1712	1
991	1528	1
988	646	1
988	2279	1
98	903	1
98	487	1
98	45	1
98	393	1
98	387	1
98	3432	1
98	3199	1
98	3134	1
98	3032	1
98	2916	1
98	2891	1
98	2802	1
98	2668	1
98	2595	1
98	2580	1
98	2540	1
98	2043	1
98	1952	1
98	1917	1
98	1847	1
98	1586	1
98	1485	1
98	1446	1
98	1417	1
98	1385	1
98	1060	1
98	105	1
968	3141	1
968	3	1
967	997	1
967	982	1
967	946	1
967	939	1
967	914	1
967	913	1
967	901	1
967	890	1
967	872	1
967	870	1
967	861	1
967	858	1
967	81	1
967	806	1
967	767	1
967	715	1
967	707	1
967	681	1
967	672	1
967	652	1
967	615	1
967	61	1
967	60	1
967	598	1
967	581	1
967	565	1
967	537	1
967	49	1
967	487	1
992	3096	1
992	225	1
992	1780	1
992	1523	1
991	786	1
991	760	1
991	3095	1
991	1712	1
991	1528	1
988	646	1
988	2279	1
98	903	1
98	487	1
98	45	1
98	393	1
98	387	1
98	3432	1
98	3199	1
98	3134	1
98	3032	1
98	2916	1
98	2891	1
98	2802	1
98	2668	1
98	2595	1
98	2580	1
98	2540	1
98	2043	1
98	1952	1
98	1917	1
98	1847	1
98	1586	1
98	1485	1
98	1446	1
98	1417	1
98	1385	1
98	1060	1
98	105	1
968	3141	1
968	3	1
967	997	1
967	982	1
967	946	1
967	939	1
967	914	1
967	913	1
967	901	1
967	890	1
967	872	1
967	870	1
967	861	1
967	858	1
967	81	1
967	806	1
967	767	1
967	715	1
967	707	1
967	681	1
967	672	1
967	652	1
967	615	1
967	61	1
967	60	1
967	598	1
967	581	1
967	565	1
967	537	1
967	49	1
967	487	1
107	1477	0
107	1476	0
107	1475	0
107	1474	0
107	1473	0
107	1472	0
107	1471	0
107	1470	0
107	147	0
107	1469	0
107	1468	0
107	1467	0
107	1466	0
107	1465	0
107	1464	0
107	1463	0
107	1462	0
107	1461	0
107	1460	0
107	1459	0
107	1458	0
107	1457	0
107	1456	0
107	1455	0
107	1454	0
107	1453	0
107	1452	0
107	1451	0
107	1450	0
107	145	0
107	1449	0
107	1448	0
107	1447	0
107	1446	0
107	1445	0
107	1444	0
107	1443	0
107	1442	0
107	1441	0
107	1440	0
107	144	0
107	1439	0
107	1438	0
107	1437	0
107	1436	0
107	1435	0
107	1434	0
107	1433	0
107	1432	0
107	1431	0
107	1430	0
107	143	0
107	1429	0
107	1428	0
107	1427	0
107	1426	0
107	1425	0
107	1424	0
107	1423	0
107	1422	0
107	1421	0
107	1420	0
107	142	0
107	1419	0
107	1418	0
107	1417	0
107	1416	0
107	1415	0
107	1414	0
107	1413	0
107	1412	0
107	1411	0
107	1410	0
107	141	0
107	1409	0
107	1408	0
107	1407	0
107	1406	0
107	1405	0
107	1404	0
107	1403	0
107	1402	0
107	1401	0
107	1400	0
107	140	0
107	14	0
107	1399	0
107	1398	0
107	1397	0
107	1396	0
107	1395	0
107	1394	0
107	1393	0
107	1392	0
107	1391	0
107	1390	0
107	139	0
107	1389	0
107	1388	0
107	1387	0
107	1386	0
107	1385	0
107	1384	0
107	1383	0
107	1382	0
107	1381	0
107	1380	0
107	138	0
107	1379	0
107	1378	0
107	1377	0
107	1376	0
107	1375	0
107	1374	0
107	1373	0
107	1372	0
107	1371	0
107	1370	0
107	137	0
107	1369	0
107	1368	0
107	1367	0
107	1366	0
107	1365	0
107	1364	0
107	1363	0
107	1362	0
107	1361	0
107	1360	0
107	136	0
107	1359	0
107	1358	0
107	1357	0
107	1356	0

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

Edited 5 Years Ago by woooee: n/a

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:

openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()

If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:

  • Treat each sub-file of file 2 as the entire file2 in the algorithm above
  • Instead of writing the final result on a non-match, write the whole line
  • On the next pass, use the modified outfile as the infile:
    • Already modified lines will either be recreated or untouched: OK
    • Some unmodified lines will be matched and modified: OK
  • On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

Edited 5 Years Ago by griswolf: n/a

Its still giving a memory error :-(.

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:

openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()

If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:

  • Treat each sub-file of file 2 as the entire file2 in the algorithm above
  • Instead of writing the final result on a non-match, write the whole line
  • On the next pass, use the modified outfile as the infile:
    • Already modified lines will either be recreated or untouched: OK
    • Some unmodified lines will be matched and modified: OK
  • On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.

my code is like :

for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"

can anyone help on this?

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

I tried using the ordered dict, but its still giving the memory error

That is too vague to be of any value. Post the actual error. Also this line
if tuple(line[:2]) in od.keys():
should just be
if tuple(line[:2]) in od:

".keys" returns a list of the keys which means there is double the amount of memory for the number of keys. If you can not do this with the memory available, then you want to use an SQLite database on disk instead.

Edited 5 Years Ago by woooee: n/a

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.

my code is like :

for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"

can anyone help on this?

I can only suggest again that you split the work into more manageable pieces. I tried to read in that many triples from a file (generated by file.write(%s\t%s\t%s\n"%(random.random(),random.random(),random.random())) and was unable to get much past 8 million lines (it did proceed very slowly after that, but there was obviously a lot of thrashing going on: it took anything from .3 seconds to 850 seconds to read the next 1000 rows. Yes: nearly 15 minutes!) I'm running on OS/X with 4G memory.

While I was doing this work, it occurred to me that you might have duplicate keys in your long file. What should happen in that case? There are three options:

  1. First key wins
  2. Last key wins
  3. Value for a randomly chosen key is used

(up to 11 million rows, my random data had no duplicate keys. I killed the program at that point since it was apparent it would not finish in reasonable time)

For reference, here's my code for reading from the file

#!/usr/bin/env python

from random import random
import time

data = {}
doublecounter = 0

def mumble(i,lp,s,ss,e):
   looptime = e-ss
   totaltime = e-s
   print "%8d: count: %d, (lp:%d) lptime: %2.2f, ttime: %2.2f"%(i,doublecounter,lp,looptime,totaltime)

def doit(f):
   when = 1000000
   global data,doublecounter
   start = time.time()
   count = 0
   lpstart = start
   end = start
   for line in f:
      if 0 == count % when:
         lpstart = end
         end = time.time()
         mumble(count,when,start,lpstart,end)
         if count == 7000000: when /= 10
         elif count == 8800000: when /=10
         elif count == 11000000: when /= 10
      s = line.split()
      k = tuple(s[:2])
      v = s[2]
      data.setdefault(k,[])
      data[k].append(v)
      if len(data[k]) > 1:
         doublecounter+= 1
      count += 1

with open ('f2','r') as f:
   doit(f)

and the first several rows of my f2 file

0.681726943412	0.317524601127	0.774220362723
0.960827529946	0.884868924006	0.805958559062
0.948431957255	0.654394548708	0.261958105771
0.790787661492	0.588754682813	0.784801700146
0.91496579649	0.65679730019	0.643389604304
0.410742283212	0.266691538578	0.251305611073
0.452187326938	0.537941526934	0.162800839411
0.298231566648	0.287904077361	0.553563473187
0.892003052642	0.483519506157	0.605940960314
0.118257450942	0.51597182572	0.868219791638

(the white spaces are tabs)

Edited 5 Years Ago by griswolf: n/a

This question has already been answered. Start a new discussion instead.