[SOLVED] Filtering out duplicate lines from a find/grep output

thundervolt · 03-22-2010, 04:07 PM

Hi all,
I'm struggling a bit with this.
I have some big files of logs that contain errors printed by an app.
They are most of the time relevant, however most of them are similar.
So i figured i could check what happened between a time interval with a find.
I´m using this one

Code:

find */application/*/app.log -type f -print0 | xargs -0 grep -E " 15:|16:1|16:2"

And I get an output similar to this one.

Code:

server1/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
server1/application/log/app.log:2010-Mar-22 15:20:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
server1/application/log/app.log:2010-Mar-22 16:25:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...

Is there a way to condensate the output lines to get only one or two, indicating the start and last occurrence of a block?
Or I need to create a program to do so?

Because right now I get thousands of similar lines, but when I'm scrolling through them i sometimes miss relevant information that i would've otherwise noted if it wasn't all that spammy.

I hope my question is clear and you guys can help me,
Thanks in advance and regards.

chrism01 · 03-22-2010, 07:08 PM

I'm 100% clear what you think you want to match/see..
However, if you just want lines with 'ERROR ..'

grep ERROR filename

if you want to see a few lines before and/or after such a line

grep ERROR -A3 -B3 filename

http://linux.die.net/man/1/grep (A=after, B=before)

If you want all lines in a time period say 15:10 - 16:10, you could try logwatch maybe? Otherwise I'd write Perl to do it. The problem (for time periods) is that although eg sed can pull out data based on a start line match and an end line match, if you can't guarantee the logfile will always(!) have a log rec for both given timestamps, you'll need to write your own more intelligent/flexible program.
Writing your own also means you can make it smart enough to only rtn recs you want to see in that time period.

dsmyth · 03-23-2010, 05:28 AM

Hi, perhaps the program "uniq" will do the job.

Or maybe not, just noticed the date... sorry.

grail · 03-23-2010, 05:33 AM

Maybe if we knew more about the errors you 'are' looking for we can help with a better regex?

berbae · 03-23-2010, 09:12 AM

Maybe he can just try to append :

|uniq --count --skip-fields=2

to the command line he gives in the first post.

thundervolt · 03-24-2010, 01:39 PM

First of all, thanks a lot for your answers, they enlightened me.
And it was a very close approach

@chrism01 I know that I will always have logging every minute, many lines per minute. that's why the

Code:

 grep -E " 15:|16:1|16:2"

work for me.

And the problem with the grep only is that some files are so big that the have to be in tar, and grep can't read those (or i don't know how, but less does the work)

@grail basically the errors are like the ones I put in the OC but here are some more lines of errors.
Edit: the errors are on app.log and
server1/application/log/app.log: is outputted by my find, the only part that is really log is what is after that, Still, i used the 72 on the skip for the uniq, because as I understood it, the grep is being done after the results of the find, and are therefore required

Code:

server1/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
server1/application/log/app.log:2010-Mar-22 15:20:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
server1/application/log/app.log:2010-Mar-22 15:20:21,514 ERROR! unable to retrieve credentials
server1/application/log/app.log:2010-Mar-22 15:23:20,310 ERROR! unable to retrieve credentials
server1/application/log/app.log:2010-Mar-22 16:25:21,428 [ExecutorThread-50@Running_app] ERROR Exception while notifying event ...
server2/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-700@Running_app] ERROR Exception while notifying event ...
server2/application/log/app.log:2010-Mar-22 15:18:21,514 ERROR! unable to retrieve credentials
server2/application/log/app.log:2010-Mar-22 15:20:21,428 [ExecutorThread-50@Running_app] ERROR Exception while notifying event ...
server2/application/log/app.log:2010-Mar-22 15:20:21,514 ERROR! unable to retrieve credentials
server2/application/log/app.log:2010-Mar-22 16:25:21,428 [ExecutorThread-700@Running_app] ERROR Exception while notifying event ...

@dsmyth and berbae
That uniq was pretty close, I changed the number of characters that it should skip to do the comparison, but it didn't exactly gave me what I wanted.
I used this

Code:

 find */application/*/app.log -type f -print0 | xargs -0 grep -E " 15:|16:1|16:2" | uniq --count --skip-fields=72

but got something like this as a result

Code:

59959 server1/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...

what i would like as a result is something more like this.

Code:

59959 server1/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
500 server1/application/log/app.log:2010-Mar-22 15:20:21,514 ERROR! unable to retrieve credentials
56600 server2/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-700@Running_app] ERROR Exception while notifying event ...
20 server2/application/log/app.log:2010-Mar-22 15:20:21,514 ERROR! unable to retrieve credentials

Thanks in advance,
Regards

thundervolt · 03-24-2010, 02:03 PM

I think I understand the mistake on this line

Code:

find */application/*/app.log -type f -print0 | xargs -0 grep -E " 15:|16:1|16:2" | uniq --count --skip-fields=72

If I tell it to skip 72 chars, it won't compare the servers and apps, and i need those to be compared, basically the only part i need to be avoided would be the date/hour because i want the net results for that timeframe, but aren't really interested in when it happened.

thundervolt · 03-24-2010, 05:45 PM

I got this one working at the moment, I'm sure it is still perfectible, but it works fine at the moment

Code:

find */application/*/app.log -type f -print0 | xargs -0 grep "ERROR" | grep " 15:5" | sed 's/ [^[:space:]]*//' | sed 's/ [^[:space:]]*//' | sort | uniq -count -w 100

the seds, delete the date/hour, so the uniq doesn't have trouble sorting them together as similar logs

What do you guys think of this solution?

Edit: I added a sort and it does exactly what i wanted, the only problem is its slooooow.

grail · 03-24-2010, 09:49 PM

Hey thundervolt

I was just wondering a few things:

1. Does this log only contain errors?
2. Based on information you have entered it appears that the first line required each time contains ",428" and the second line you
would like to retrieve always has ",514". Is this the case or just in the examples you have given?
3. Would it be possible for you to attach maybe a 100 or so lines from the log in a file to this thread? (Maybe help to give you better answers)

thundervolt · 03-24-2010, 10:11 PM

Hi grail,
Well, no it doesn't always contain those numbers it was just an example I was giving
And no, it doesn't purely contain errors, but i was already grepping only the errors, so i had no problems there, i posted my final solution above your post (the actual implementation has a little more tweaks and seds to filter out info) but thats the solution I found.

grail · 03-25-2010, 03:32 AM

Based on you saying you wanted something more like:

Quote:

59959 server1/application/log/app.log:2010-Mar-22 15:16:21,428 [ExecutorThread-18@Running_app] ERROR Exception while notifying event ...
500 server1/application/log/app.log:2010-Mar-22 15:20:21,514 ERROR! unable to retrieve credentials

How about:

Code:

find -name app.log -exec awk 'BEGIN{f=0;g=0}$0 ~ /15:2.*\[/{k=$0;f=0;g=1}f && $0 ~ /ERROR!/{print k"\n"$0;g=0;f=0}g{f=1}' {} \;

This yielded your results requested above based on the small amount of input data provided.