Hi, I'm trying to count the number of unique IP addresses present in an apache log file. I am using a text file to store the unique IP addresses as this needs to scale to large numbers of IP addresses (upwards of 100 million) so using a dictionary or any other data structure stored in memory would make the machine run out of RAM.

import re
import os
import fileinput

def ipCounter(myline):
	if ippattern.search(myline):
		#search for all existing ips and store in list
		iplist = ippattern.findall(myline)
		#Open the IP file in read mode
		ip_file = open("unique_ips.txt",'r')
		for eachip in iplist:
			for line in ip_file:
				if line.find(eachip) > 0:
					break
				else:
					print "adding to file..."
					#Close file and open in append mode
					ip_file.close()
					ip_file = open("unique_ips.txt",'a')
					#Now write the new IP address found to the file
					ip_file.write(eachip)
					#Close file again
					ip_file.close()

ippattern = re.compile(r"\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b")

#Iterate through all files in directory
for fname in os.listdir(os.getcwd()):
	if fname.find('LOG' or 'log')> 0:
		print "Processing....."+fname
		inputlog = fname
		#Open the log and start going through it line by line
                for line in fileinput.input(inputlog):
			ipCounter(line)
			


fileinput.close()

My final output file does'nt seem to get populated by any addresses, I'm not sure about the part in ipCounter() where I check if the IP address is already present and if not, append it to the file. I would be grateful for any help.

Thanks,
Adi

Can you give us "unique_ips.txt".
You can make to new file only smaller,with somthing like 100 ip adresses.

Then it is easier to test out a solution.
And please use 4 space for indentations PEP-8

Can you give us "unique_ips.txt".
You can make to new file only smaller,with somthing like 100 ip adresses.

Hi, I've attached unique_ips.txt with a few randomly generated ip addresses (with some repetitions in it).

Please remember that unique-ips.txt is the Output file I'm generating and not the input file (which is an apache log).

You can look at this and se if it help you out.
And see if this is on right track,should be fast and not use to much memory.
But have to try it out in larger scale.

#This is after ippattern.findall has find ip in log file
'''-->ip_list.txt
100.11.23.45
100.11.23.46
100.11.23.45
100.11.23.48
100.11.23.49
100.11.23.50
100.11.23.51
100.11.23.52
100.11.23.53
100.11.23.52
100.11.23.55
100.11.23.56
100.11.23.55
100.11.23.58
'''
#ip_list has 11 unique ip`s total 14

#open a file for write 
ip_file = open("unique_ips.txt",'w')

#First we count unique ip`s
count_unique_IP = sum(1 for line in set(open('ip_list.txt')))
print count_unique_IP

#Then we find and sort unique ip
with open('ip_list.txt') as f:
    lines = sorted(set(line.strip('\n') for line in f))  
for line in lines:
    print line  #Test print
    ip_file.write(line)
    ip_file.close

'''
11
-->unique_ips.txt
100.11.23.45
100.11.23.46
100.11.23.48
100.11.23.49
100.11.23.50
100.11.23.51
100.11.23.52
100.11.23.53
100.11.23.55
100.11.23.56
100.11.23.58   
'''

How about doing
1) statistics run with that normal inmemory set and then taking the found starting numbers in memory to sort with repeated read of file for each starting number. Or you could write each ip xx.yy.zz.aa to file ipsxx.txt, where xx,yy,zz,aa are two number hex representations of ip numbers. Even maybe dividing by 16 the file by the very first hex number would be enough. So you would open sixteen files and write each to file according to first numbers first number. Then load the files one by one to list, doing the check against the set of previously read number from that file (skipping if it is allready read) and sort each by taking in set and making list of it and sorting (remember to use equal length representation in string form or preparing list of list of four numbers for sorting by partition or split('.') and int from all parts). Of course it is possible to use also decimal numbers, but for string sorting you need three number integers

For filtering you can adapt my scanning routine for valid date strings (only one more part and only point accepted). If you split the file from the next point and check if the end of first part is valid ip numbers first part (0..255), the rest part has same form as my date routines date, only each part has value (0..255)

Hi, I'm trying to count the number of unique IP addresses present in an apache log file. I am using a text file to store the unique IP addresses as this needs to scale to large numbers of IP addresses (upwards of 100 million) so using a dictionary or any other data structure stored in memory would make the machine run out of RAM.
Adi

How much the machine would have RAM as I tested the processing such a IP file, it has no problem processing it in memory, even it is of course not so super fast any more for big amounts.

Here is screen shots for test output for case of million random IP. Machine has 1.5 GB memory and is old Sempron single core processor.

And no one has mentioned an SQLite file on disk because? You could also use dbm or anydbm if you don't want to try SQLite, but speed of these for that many records may or may not be a drawback. Also with these, each key has to be unique, but if you just want to store/count unique addresses that should not be a problem.

I did the conversion routines between IP numbers point notation and integers and by using those I got the routine of reading numbers to set, converting to list and sorting, writing to file to point that I got 10 million numbers (random, totally unrealistic case but heavier for routine than reality with many duplicates.

9820000 251.175.175.166
9830000 251.239.242.164
9840000 252.48.227.31
9850000 252.115.162.250
9860000 252.181.196.27
9870000 252.248.67.194
9880000 253.57.28.211
9890000 253.122.191.2
9900000 253.189.86.252
9910000 253.255.102.118
9920000 254.64.203.18
9930000 254.132.9.249
9940000 254.196.90.223
9950000 255.5.230.221
9960000 255.71.96.186
9970000 255.135.189.144
9980000 255.201.68.112
1 min 40.639 s
ready, result stored in "unique_ip.txt". Push enter to quit.