Hi, I'm trying to count the number of unique IP addresses present in an apache log file. I am using a text file to store the unique IP addresses as this needs to scale to large numbers of IP addresses (upwards of 100 million) so using a dictionary or any other data structure stored in memory would make the machine run out of RAM.

import re
import os
import fileinput

def ipCounter(myline):
	if ippattern.search(myline):
		#search for all existing ips and store in list
		iplist = ippattern.findall(myline)
		#Open the IP file in read mode
		ip_file = open("unique_ips.txt",'r')
		for eachip in iplist:
			for line in ip_file:
				if line.find(eachip) > 0:
					break
				else:
					print "adding to file..."
					#Close file and open in append mode
					ip_file.close()
					ip_file = open("unique_ips.txt",'a')
					#Now write the new IP address found to the file
					ip_file.write(eachip)
					#Close file again
					ip_file.close()

ippattern = re.compile(r"\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b")

#Iterate through all files in directory
for fname in os.listdir(os.getcwd()):
	if fname.find('LOG' or 'log')> 0:
		print "Processing....."+fname
		inputlog = fname
		#Open the log and start going through it line by line
                for line in fileinput.input(inputlog):
			ipCounter(line)
			


fileinput.close()

My final output file does'nt seem to get populated by any addresses, I'm not sure about the part in ipCounter() where I check if the IP address is already present and if not, append it to the file. I would be grateful for any help.

Thanks,
Adi

Can you give us "unique_ips.txt".
You can make to new file only smaller,with somthing like 100 ip adresses.

Then it is easier to test out a solution.
And please use 4 space for indentations PEP-8

Edited 6 Years Ago by snippsat: n/a

Can you give us "unique_ips.txt".
You can make to new file only smaller,with somthing like 100 ip adresses.

Hi, I've attached unique_ips.txt with a few randomly generated ip addresses (with some repetitions in it).

Please remember that unique-ips.txt is the Output file I'm generating and not the input file (which is an apache log).

Attachments
100.11.23.45
100.11.23.46
100.11.23.47
100.11.23.48
100.11.23.49
100.11.23.50
100.11.23.51
100.11.23.52
100.11.23.53
100.11.23.54
100.11.23.55
100.11.23.56
100.11.23.57
100.11.23.58
100.11.23.59
100.11.23.60
100.11.23.61
100.11.23.62
100.11.23.63
100.11.23.64
100.11.23.65
100.11.23.66
100.11.23.67
100.11.23.68
100.11.23.69
100.11.23.70
100.11.23.71
100.11.23.72
100.11.23.73
100.11.23.74
100.11.23.75
100.11.23.76
100.11.23.77
100.11.23.78
100.11.23.79
100.11.23.80
100.11.23.81
100.11.23.82
100.11.23.83
100.11.23.84
100.11.23.85
100.11.23.86
100.11.23.87
100.11.23.88
100.11.23.89
100.11.23.90
100.11.23.91
100.11.23.92
100.11.23.93
100.11.23.94
100.11.23.95
100.11.23.96
100.11.23.97
100.11.23.98
100.11.23.99
100.11.23.100
100.11.23.101
100.11.23.102
100.11.23.103
100.11.23.104
100.11.23.105
100.11.23.106
100.11.23.107
100.11.23.108
100.11.23.109
100.11.23.110
100.11.23.111
100.11.23.112
100.11.23.113
100.11.23.114
100.11.23.115
100.11.23.116
100.11.23.117
100.11.23.118
100.11.23.119
100.11.23.120
100.11.23.121
100.11.23.122
100.11.23.123
100.11.23.124
100.11.23.125
100.11.23.126
100.11.23.127
100.11.23.128
100.11.23.129
100.11.23.130
100.11.23.131
100.11.23.132
100.11.23.133
100.11.23.134
100.11.23.135
100.11.23.136
100.11.23.137
100.11.23.138
100.11.23.139
100.11.23.140
100.11.23.141
100.11.23.142
100.11.23.143
100.11.23.144
100.11.23.145
100.11.23.146
100.11.23.147
100.11.23.148
100.11.23.149
100.11.23.150
100.11.23.151
100.11.23.152
100.11.23.153
100.11.23.154
100.11.23.155
100.11.23.156
100.11.23.157
100.11.23.158
100.11.23.159
100.11.23.160
100.11.23.161
100.11.23.162
100.11.23.163
100.11.23.164
100.11.23.165
100.11.23.166
100.11.23.167
100.11.23.168
100.11.23.169
100.11.23.170
100.11.23.171
100.11.23.172
204.54.128.62
204.54.128.63
204.54.128.64
204.54.128.65
204.54.128.66
204.54.128.67
204.54.128.68
204.54.128.69
204.54.128.70
204.54.128.71
204.54.128.72
204.54.128.73
204.54.128.74
100.11.23.47
100.11.23.48
100.11.23.49
204.54.128.75
204.54.128.76
204.54.128.77
204.54.128.78
204.54.128.79
204.54.128.80
204.54.128.81
204.54.128.82
204.54.128.83
204.54.128.84
204.54.128.85
204.54.128.86
204.54.128.87
204.54.128.88
204.54.128.89
204.54.128.90
204.54.128.91
204.54.128.92
204.54.128.93
204.54.128.94
204.54.128.95
204.54.128.96
204.54.128.97
204.54.128.98
204.54.128.99
204.54.128.100
204.54.128.101
204.54.128.178
204.54.128.179
204.54.128.180
204.54.128.181
204.54.128.102
204.54.128.103
204.54.128.104
204.54.128.105
204.54.128.106
204.54.128.107
204.54.128.108
204.54.128.109
204.54.128.110
204.54.128.111
204.54.128.112
204.54.128.113
204.54.128.114
204.54.128.115
204.54.128.116
204.54.128.117
204.54.128.118
204.54.128.119
204.54.128.120
204.54.128.121
204.54.128.122
204.54.128.123
204.54.128.124
204.54.128.125
204.54.128.126
204.54.128.127
204.54.128.128
204.54.128.129
204.54.128.130
204.54.128.131
204.54.128.132
204.54.128.133
204.54.128.134
204.54.128.135
204.54.128.136
204.54.128.137
204.54.128.138
204.54.128.139
204.54.128.140
204.54.128.141
204.54.128.142
204.54.128.143
204.54.128.144
204.54.128.145
204.54.128.146
204.54.128.147
204.54.128.148
204.54.128.149
204.54.128.150
204.54.128.151
204.54.128.152
204.54.128.153
204.54.128.154
204.54.128.155
204.54.128.156
204.54.128.157
204.54.128.158
204.54.128.159
204.54.128.160
204.54.128.161
204.54.128.162
204.54.128.163
204.54.128.164
204.54.128.165
204.54.128.166
204.54.128.167
204.54.128.168
204.54.128.169
204.54.128.170
204.54.128.171
204.54.128.172
204.54.128.173
204.54.128.174
204.54.128.175
204.54.128.176
204.54.128.177
204.54.128.178
204.54.128.179
204.54.128.180
204.54.128.181
204.54.128.182
204.54.128.183
204.54.128.184
204.54.128.185
204.54.128.186
204.54.128.187
204.54.128.188
204.54.128.189

You can look at this and se if it help you out.
And see if this is on right track,should be fast and not use to much memory.
But have to try it out in larger scale.

#This is after ippattern.findall has find ip in log file
'''-->ip_list.txt
100.11.23.45
100.11.23.46
100.11.23.45
100.11.23.48
100.11.23.49
100.11.23.50
100.11.23.51
100.11.23.52
100.11.23.53
100.11.23.52
100.11.23.55
100.11.23.56
100.11.23.55
100.11.23.58
'''
#ip_list has 11 unique ip`s total 14

#open a file for write 
ip_file = open("unique_ips.txt",'w')

#First we count unique ip`s
count_unique_IP = sum(1 for line in set(open('ip_list.txt')))
print count_unique_IP

#Then we find and sort unique ip
with open('ip_list.txt') as f:
    lines = sorted(set(line.strip('\n') for line in f))  
for line in lines:
    print line  #Test print
    ip_file.write(line)
    ip_file.close

'''
11
-->unique_ips.txt
100.11.23.45
100.11.23.46
100.11.23.48
100.11.23.49
100.11.23.50
100.11.23.51
100.11.23.52
100.11.23.53
100.11.23.55
100.11.23.56
100.11.23.58   
'''

Edited 6 Years Ago by snippsat: n/a

How about doing
1) statistics run with that normal inmemory set and then taking the found starting numbers in memory to sort with repeated read of file for each starting number. Or you could write each ip xx.yy.zz.aa to file ipsxx.txt, where xx,yy,zz,aa are two number hex representations of ip numbers. Even maybe dividing by 16 the file by the very first hex number would be enough. So you would open sixteen files and write each to file according to first numbers first number. Then load the files one by one to list, doing the check against the set of previously read number from that file (skipping if it is allready read) and sort each by taking in set and making list of it and sorting (remember to use equal length representation in string form or preparing list of list of four numbers for sorting by partition or split('.') and int from all parts). Of course it is possible to use also decimal numbers, but for string sorting you need three number integers

For filtering you can adapt my scanning routine for valid date strings (only one more part and only point accepted). If you split the file from the next point and check if the end of first part is valid ip numbers first part (0..255), the rest part has same form as my date routines date, only each part has value (0..255)

Hi, I'm trying to count the number of unique IP addresses present in an apache log file. I am using a text file to store the unique IP addresses as this needs to scale to large numbers of IP addresses (upwards of 100 million) so using a dictionary or any other data structure stored in memory would make the machine run out of RAM.
Adi

How much the machine would have RAM as I tested the processing such a IP file, it has no problem processing it in memory, even it is of course not so super fast any more for big amounts.

Here is screen shots for test output for case of million random IP. Machine has 1.5 GB memory and is old Sempron single core processor.

Attachments huge_end.gif 30.66 KB huge_start.gif 29.68 KB

And no one has mentioned an SQLite file on disk because? You could also use dbm or anydbm if you don't want to try SQLite, but speed of these for that many records may or may not be a drawback. Also with these, each key has to be unique, but if you just want to store/count unique addresses that should not be a problem.

I did the conversion routines between IP numbers point notation and integers and by using those I got the routine of reading numbers to set, converting to list and sorting, writing to file to point that I got 10 million numbers (random, totally unrealistic case but heavier for routine than reality with many duplicates.

9820000 251.175.175.166
9830000 251.239.242.164
9840000 252.48.227.31
9850000 252.115.162.250
9860000 252.181.196.27
9870000 252.248.67.194
9880000 253.57.28.211
9890000 253.122.191.2
9900000 253.189.86.252
9910000 253.255.102.118
9920000 254.64.203.18
9930000 254.132.9.249
9940000 254.196.90.223
9950000 255.5.230.221
9960000 255.71.96.186
9970000 255.135.189.144
9980000 255.201.68.112
1 min 40.639 s
ready, result stored in "unique_ip.txt". Push enter to quit.
This article has been dead for over six months. Start a new discussion instead.