LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-21-2009, 10:51 AM   #1
elinenbe
LQ Newbie
 
Registered: Oct 2007
Posts: 23

Rep: Reputation: 15
Question grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.


I have a file that is 20 - 80+ MB in size that is a certain type of log file.

It logs one of our processes and this process is multi-threaded. Therefore the log file is kind of a mess. Here's an example:

The logfile looks like: "DATE TIME - THREAD ID - Details", and a new file is created for each day
Quote:
20090409 000122 - BD0 - Order 123 starting session
20090409 000122 - BD0 - Processing 1
20090409 000122 - BD0 - More Processing
20090409 000123 - EF0 - Order 234 starting session
20090409 000124 - EF0 - Processing
20090409 000124 - BD0 - Processing 2
20090409 000125 - BD0 - More Processing
20090409 000125 - EF0 - Processing
20090409 000125 - DD1 - Cancel 345 starting session
20090409 000125 - DD1 - Processing
20090409 000126 - DD1 - Processing 2
20090409 000126 - BD0 - Order 123 shutting down
20090409 000127 - 11F - Query 543 starting session
20090409 000127 - 11F - Processing
..
..
20090409 000135 - 11F - Query 543 shutting down
..
20090409 000140 - EF0 - Order 234 shutting down
..
..
..
20090409 000143 - DD1 - Cancel 345 shutting down
Now, here's where it gets to be a pain... I need to pull out the lines from "Starting Session" to "Ending Session" for each Thread ID, and dump these to separate files. HOWEVER, the Thread ID CAN be duplicated over the course of a day -- but usually not for many hours.

A session can last from 30 seconds to 4 minutes or so (~1200 lines) in the logfile, and there can be up to 20 concurrent sessions.

Now, I have something that works -- although quite slowly. I end up grepping and sedding the file over and over. When the file gets large, it takes a MASSIVE amount of time. I am hoping that someone here can help me optimize this. If possible, I'd like to use bash.

Thanks,
Eric

Here is the code I have that works, but is _slow_

Code:
	if [[ -e "$log_file" ]]
	then
		echo "parsing: "$log_file
		grep "starting session" $log_file | while read line 
		do
			thread=`echo $line | cut -d' ' -f4`
			sessiontype=`echo $line | cut -d' ' -f6`
			sessionnumber=`echo $line | cut -d' ' -f7`

			echo "  first line of session: "${line:0:25}"..."
			line2=`echo  - $thread - $sessiontype $sessionnumber shutting down`
			echo "  last line of session: "${line2:0:25}"..."
			sed -n "/$line/,/$line2/p" $log_file | grep " - $thread - ">session.$thread.$sessiontype.$sessionnumber
		done
	....
This gives me a number of files, that using the example log above would be created as shown below:
Quote:
file: session.BD0.Order.123
20090409 000122 - BD0 - Order 123 starting session
20090409 000122 - BD0 - Processing 1
20090409 000122 - BD0 - More Processing
20090409 000124 - BD0 - Processing 2
20090409 000125 - BD0 - More Processing
20090409 000126 - BD0 - Order 123 shutting down

file: session.DD1.Cancel.345
20090409 000125 - DD1 - Cancel 345 starting session
20090409 000125 - DD1 - Processing
20090409 000126 - DD1 - Processing 2
..
..
..
20090409 000143 - DD1 - Cancel 345 shutting down

file: session.11F.Query.543
20090409 000127 - 11F - Query 543 starting session
20090409 000127 - 11F - Processing
..
..
20090409 000135 - 11F - Query 543 shutting down

file: session.EF0.Order.234
20090409 000123 - EF0 - Order 234 starting session
20090409 000124 - EF0 - Processing
20090409 000125 - EF0 - Processing
20090409 000140 - EF0 - Order 234 shutting down
 
Old 04-21-2009, 12:03 PM   #2
rriggs
Member
 
Registered: Mar 2009
Location: Colorado, US
Distribution: Fedora 13, Fedora 14, RHEL6 Beta
Posts: 46

Rep: Reputation: 17
You are using the wrong tools. This can be done much more efficiently with a real programming language such as Python or Ruby.
 
Old 04-21-2009, 02:28 PM   #3
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
I would try using awk for this, if you don't know perl

For awk see the tutorial here:
http://www.grymoire.com/Unix/Awk.html

I would write the script for you, but I'm not the best awk programmer, and I don't have the time.

Either way, parsing large files using your method is bound to be very slow.
 
Old 04-21-2009, 02:51 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Is this fast enough?
Code:
sort --stable --field-separator=- --key=2,2 "$log_file" | \
    csplit --quiet --prefix=session. --elide-empty-files - '/starting session$/' '{*}'

for s in session.* ; do
    PRETTY_NAME=$(sed -n '1s/^.*- \([[:alnum:]]\+\) - \([[:alnum:]]\+\) \([[:digit:]]\+\).*$/session.\1.\2.\3/p' $s)
    mv $s $PRETTY_NAME
done
 
Old 04-22-2009, 10:17 AM   #5
elinenbe
LQ Newbie
 
Registered: Oct 2007
Posts: 23

Original Poster
Rep: Reputation: 15
ntubski,

Wow. That's fast -- that's for all the help!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
W3C log parser acedreds Linux - Software 0 10-16-2008 03:10 PM
Apache log parser tool manojbarot1 Linux - Software 2 01-03-2008 12:10 AM
Log File Parser Program kaplan71 Linux - General 1 05-11-2005 08:55 PM
qpopper log parser linuxbox25 Programming 0 03-08-2004 04:47 PM
firewall log parser tarballedtux Linux - Software 0 08-04-2003 09:04 PM


All times are GMT -5. The time now is 12:12 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration