Hi

I have two python lists. list1 is a \n delimited web proxy log file. list2 is a list of domain names. I want to cycle through the list of domain names and for every log file entry check if the domain name exists anywhere in the log entry. If it exists then print out the log entry in full to stdout.

list1 = [20100122 http google.com 200, 20100124 http hushmail.com 200, 20100123 http microsoft.com 404 ]
list2 = [google.com, yahoo.com, msn.com, hotmail.com, gmail.com]

The expected result of the python script would be:

$./parse-logfile.py
Line 1: 20100122 http google.com 200
Line 3: 20100123 http microsoft.com 404
$

Psuedo code would look something like:

read in list1, read in list2
for each entry in list2
for each entry in list1
if entry in list1 exists anywhere in list2 then
print 'Line', linenum
print Log Entry

I've played with strings, lists and sets but can't see to find exactly what I need.

Thanks.

Recommended Answers

All 4 Replies

I can see 2 solutions :

list1 = ["20100122 http google.com 200", "20100124 http hushmail.com 200", "20100123 http microsoft.com 404" ]
list2 = ["google.com", "yahoo.com", "msn.com", "hotmail.com", "gmail.com"]
# 1
for i, e1 in enumerate(list1):
    for e2 in list2:
        if e2 in e1:
            print ("line %d : %s" % (i,e1))

# 2
for i,e in enumerate(list1):
    d=e.split()[2]
    if d in list2:
        print ("line %d : %s" % (i,e))

Note that I don't understand where your microsoft output line comes from.

Thanks for the reply. I'll have a chance to try these solutions later and get back to the list. Yes you're right there is a typo in list2. It should read:

list2 = [google.com, yahoo.com, microsoft.com, hotmail.com, gmail.com]

I can see 2 solutions :

list1 = ["20100122 http google.com 200", "20100124 http hushmail.com 200", "20100123 http microsoft.com 404" ]
list2 = ["google.com", "yahoo.com", "msn.com", "hotmail.com", "gmail.com"]
# 1
for i, e1 in enumerate(list1):
    for e2 in list2:
        if e2 in e1:
            print ("line %d : %s" % (i,e1))

# 2
for i,e in enumerate(list1):
    d=e.split()[2]
    if d in list2:
        print ("line %d : %s" % (i,e))

Note that I don't understand where your microsoft output line comes from.

Thanks again for your reply. I have tried both pieces of code. The logic works fine for each but the comparison in the if statement always fails. The domain name is never found in the log entry.

I add print statements in the for loops and everything looks fine. It's cycling through the two files correctly.

I'm wondering if there are hidden characters that are causing to fail but I'm not sure. I'm reading in the two files as follows:

r=open('filename', r)
r_lines=r.readlines()

apologies for replying to my own post. I think it was trailing white space that caused my issues. The following code is now working:

for i, value1 in enumerate(proxy_lines):
   d=value1.split()[10]
   for value2 in sites_lines:
        if value2.strip() == d:
           print 'Hit found in', ("line %d : %s (i, value1))
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.