LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   grep'ing and sed'ing chunks in bash... need help on speeding up a log parser. (http://www.linuxquestions.org/questions/programming-9/greping-and-seding-chunks-in-bash-need-help-on-speeding-up-a-log-parser-720685/)

elinenbe 04-21-2009 11:51 AM

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.
 
I have a file that is 20 - 80+ MB in size that is a certain type of log file.

It logs one of our processes and this process is multi-threaded. Therefore the log file is kind of a mess. Here's an example:

The logfile looks like: "DATE TIME - THREAD ID - Details", and a new file is created for each day
Quote:

20090409 000122 - BD0 - Order 123 starting session
20090409 000122 - BD0 - Processing 1
20090409 000122 - BD0 - More Processing
20090409 000123 - EF0 - Order 234 starting session
20090409 000124 - EF0 - Processing
20090409 000124 - BD0 - Processing 2
20090409 000125 - BD0 - More Processing
20090409 000125 - EF0 - Processing
20090409 000125 - DD1 - Cancel 345 starting session
20090409 000125 - DD1 - Processing
20090409 000126 - DD1 - Processing 2
20090409 000126 - BD0 - Order 123 shutting down
20090409 000127 - 11F - Query 543 starting session
20090409 000127 - 11F - Processing
..
..
20090409 000135 - 11F - Query 543 shutting down
..
20090409 000140 - EF0 - Order 234 shutting down
..
..
..
20090409 000143 - DD1 - Cancel 345 shutting down
Now, here's where it gets to be a pain... I need to pull out the lines from "Starting Session" to "Ending Session" for each Thread ID, and dump these to separate files. HOWEVER, the Thread ID CAN be duplicated over the course of a day -- but usually not for many hours.

A session can last from 30 seconds to 4 minutes or so (~1200 lines) in the logfile, and there can be up to 20 concurrent sessions.

Now, I have something that works -- although quite slowly. I end up grepping and sedding the file over and over. When the file gets large, it takes a MASSIVE amount of time. I am hoping that someone here can help me optimize this. If possible, I'd like to use bash.

Thanks,
Eric

Here is the code I have that works, but is _slow_

Code:

        if [[ -e "$log_file" ]]
        then
                echo "parsing: "$log_file
                grep "starting session" $log_file | while read line
                do
                        thread=`echo $line | cut -d' ' -f4`
                        sessiontype=`echo $line | cut -d' ' -f6`
                        sessionnumber=`echo $line | cut -d' ' -f7`

                        echo "  first line of session: "${line:0:25}"..."
                        line2=`echo  - $thread - $sessiontype $sessionnumber shutting down`
                        echo "  last line of session: "${line2:0:25}"..."
                        sed -n "/$line/,/$line2/p" $log_file | grep " - $thread - ">session.$thread.$sessiontype.$sessionnumber
                done
        ....

This gives me a number of files, that using the example log above would be created as shown below:
Quote:

file: session.BD0.Order.123
20090409 000122 - BD0 - Order 123 starting session
20090409 000122 - BD0 - Processing 1
20090409 000122 - BD0 - More Processing
20090409 000124 - BD0 - Processing 2
20090409 000125 - BD0 - More Processing
20090409 000126 - BD0 - Order 123 shutting down

file: session.DD1.Cancel.345
20090409 000125 - DD1 - Cancel 345 starting session
20090409 000125 - DD1 - Processing
20090409 000126 - DD1 - Processing 2
..
..
..
20090409 000143 - DD1 - Cancel 345 shutting down

file: session.11F.Query.543
20090409 000127 - 11F - Query 543 starting session
20090409 000127 - 11F - Processing
..
..
20090409 000135 - 11F - Query 543 shutting down

file: session.EF0.Order.234
20090409 000123 - EF0 - Order 234 starting session
20090409 000124 - EF0 - Processing
20090409 000125 - EF0 - Processing
20090409 000140 - EF0 - Order 234 shutting down

rriggs 04-21-2009 01:03 PM

You are using the wrong tools. This can be done much more efficiently with a real programming language such as Python or Ruby.

H_TeXMeX_H 04-21-2009 03:28 PM

I would try using awk for this, if you don't know perl ;)

For awk see the tutorial here:
http://www.grymoire.com/Unix/Awk.html

I would write the script for you, but I'm not the best awk programmer, and I don't have the time.

Either way, parsing large files using your method is bound to be very slow.

ntubski 04-21-2009 03:51 PM

Is this fast enough?
Code:

sort --stable --field-separator=- --key=2,2 "$log_file" | \
    csplit --quiet --prefix=session. --elide-empty-files - '/starting session$/' '{*}'

for s in session.* ; do
    PRETTY_NAME=$(sed -n '1s/^.*- \([[:alnum:]]\+\) - \([[:alnum:]]\+\) \([[:digit:]]\+\).*$/session.\1.\2.\3/p' $s)
    mv $s $PRETTY_NAME
done


elinenbe 04-22-2009 11:17 AM

ntubski,

Wow. That's fast -- that's for all the help!


All times are GMT -5. The time now is 04:51 PM.