Hi, i am working on a script that will filter out the unique trace of error logs from a log file. The log file might contain different types of exception traces(like Null Pointer Exception, Illegal Argument Exception etc.) and i want to retrieve each such exception trace and also count thier occurrence.

The approach that i have followed so far is write an awk script to fetch each exception trace, redirect the entire trace as it is fetched to a temporary file and then fetch the next trace of exception and compare it the one in temporary file and if unmatched, append the logs to a master file and then move on to the next trace of exceptions repeating the procedure until the end of file. The master file in the end would contain the unique exception traces.

Here is a snippet of awk script

awk '
{ for( ; $0 !~ /:EL/; getline)          --The last line of log trace is :EL
{ if($0 !~ "")
{ print $0 >> temp_file }
{ system("./compare.sh temp_file") }
{ system("rm temp_file") }' log_file

Note: Compare.sh contains the comparison logic that will compare the temp_file, retrive each trace from master file and the compare the 2 files line by line.

The script seems to be working fine but the problem is that the execution time is coming out large. Looking for some new approach or some help to bring down the execution time to not more than 10-15 sec for a file. The log file normally contains a total of 20 to 25 exception traces and the problem is that the size of master file would go on increasing as each trace is fetched and number of comparisons would also increase.

Thanks in advance.

In stead of storing the entire trace and checking against that why not store some representation of that? For instance, a md5sum takes negligible time on my system for a 13K file. You could store the MD5 of the unique traces you find and for each new one only do something like

$ grep ${MD5} hash_list.txt


You will, of course, still need to store all of the exception output for analysis (in a separate location) but the search over the entries will be quicker.

Thanks for the reply.

The problem with using MD5 is that my application runs on a multi-release environment. Therefore, if a similar exception occurs in the same method on two different releases, the line numbers that prints in the log trace might be different and so would be the output of MD5 even if the exception occurred due to the same reason.

at com.authenticate.validatePassword.getPassword(PasswordValidator.java:23) -- Release 1
at com.authenticate.validatePassword.getPassword(PasswordValidator.java:32) -- Release 2

That's why, i am comparing the lines from the two files after removing the line number or infact comparing only the portion before the brackets. So, i have to read 2 files line by line, remove the bracket portion and then compare. This process takes a lot of time to execute.

I was thinking more along the lines of something like:

$ cat hash.db
 $ cat t.t
at com.authenticate.validatePassword.getPassword(PasswordValidator.java:23) -- Release 1
at com.authenticate.validatePassword.getPassword(PasswordValidator.java:23) -- Release 1

 $ ./same_check.sh t.t
PREFIX: at com.authenticate.validatePassword.getPassword
MD5: 5fda96d64ab793b1afd5c99ecdc23153  -
Unique. Saving
Duplicate entry (at com.authenticate.validatePassword.getPassword)

 $ ./same_check.sh t.t
Duplicate entry (at com.authenticate.validatePassword.getPassword)
Duplicate entry (at com.authenticate.validatePassword.getPassword)

 $ cat t2.t
at com.authenticate.validatePassword.Foobalahub(PasswordValidator.java:23) -- Release 1
at com.authenticate.validatePassword.Whosamawhats(PasswordValidator.java:23) -- Release 1

 $ ./same_check.sh t2.t
PREFIX: at com.authenticate.validatePassword.Foobalahub
MD5: 62b032e311615bc7e80c388fccb4f652  -
Unique. Saving
PREFIX: at com.authenticate.validatePassword.Whosamawhats
MD5: e3397219ad75bcb26cad5ee5e9a80668  -
Unique. Saving

 $ cat hash.db
5fda96d64ab793b1afd5c99ecdc23153  -
62b032e311615bc7e80c388fccb4f652  -
e3397219ad75bcb26cad5ee5e9a80668  -

The script looks like



while read line; do

    prefix=`echo "${line}" | awk 'BEGIN{FS="("}{print $1}'`
    MD5=`echo "${prefix}" | md5sum -`
    if [[ `egrep "${MD5}" ${HASH_DB}` ]]; then
        echo "Duplicate entry (${prefix})"
        echo "PREFIX: ${prefix}"
        echo "MD5: ${MD5}"
        echo "Unique. Saving"
        echo "${MD5}" >> ${HASH_DB}

done < ${1}

This looks at each entry in the input exactly once and checks it against a database of previously seen entries. If not found the new entry is created otherwise a notification is printed. The example run above should explain how it works in practice.

N.B. This does not address how you would store the actual text of the entries yourself. You'd have to do something with ${MD5} and ${prefix} to make that happen but that shouldn't be that hard.

At this point it might be sufficient to just store the line itself (since the MD5 is superfluous at that point). I'm having trouble understanding the exact context so you'll have to make that decision on your own.

Thanks for the solution. That works great. The execution time has reduced drastically. Thanks once again.