LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 08-11-2010, 11:05 AM   #1
logicalfuzz
Member
 
Registered: Aug 2005
Distribution: Arch Linux
Posts: 291

Rep: Reputation: 39
the cut delimiter dilemma


Here's a sample of the log that i am trying to parse:
Code:
Aug 10 12:02:59 alpha beta gamma
Now i want to extract the field 'alpha' here, for which is use
Code:
cut -d" " -f4
This works like a charm...

However, there's a small bug... if the date is say Aug 9, the line looks something like this:
Code:
Aug  9 12:02:59 alpha beta gamma
So the above cut command in this case returns "12:02:59"!!! i figure this is because there are two 'whitespace' characters before the second field - '9'

Is there a better way of defining the delimiter in the cut command, which would let me use this in a script without any such bugs? If not, are there any other tools that i can use, which will take care of this bug? (maybe perl, sed?)
 
Old 08-11-2010, 11:16 AM   #2
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Assuming you want the "alpha" to be returned:
Code:
echo "Aug  9 12:02:59 alpha beta gamma" | awk '{print $4}'
 
1 members found this post helpful.
Old 08-11-2010, 12:05 PM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
Yes, awk is the best tool here, since it's especially designed for manipulating fields.

But another option would be to filter the string through tr and condense multiple spaces into one (or replace them with a different separator).

Code:
echo "Aug  9 12:02:59 alpha beta gamma" | tr -s " " | cut -d " " -f 4
 
Old 08-12-2010, 05:29 AM   #4
logicalfuzz
Member
 
Registered: Aug 2005
Distribution: Arch Linux
Posts: 291

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by GrapefruiTgirl View Post
Assuming you want the "alpha" to be returned:
Code:
echo "Aug  9 12:02:59 alpha beta gamma" | awk '{print $4}'
Thanks GrapefruiTgirl! works like a charm
 
Old 08-12-2010, 05:33 AM   #5
logicalfuzz
Member
 
Registered: Aug 2005
Distribution: Arch Linux
Posts: 291

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by David the H. View Post
Yes, awk is the best tool here, since it's especially designed for manipulating fields.

But another option would be to filter the string through tr and condense multiple spaces into one (or replace them with a different separator).

Code:
echo "Aug  9 12:02:59 alpha beta gamma" | tr -s " " | cut -d " " -f 4
Thanks David.. i would refrain from using the tr in the pipe... because the file i am trying to grep is atleast 27-30 gigs in size.. i am guessing more pipes would mean more time processing it.. no?
 
Old 08-12-2010, 06:24 AM   #6
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Yes, more pipes, and more tools in the pipeline, means more processing and more time.

Note too, that awk can open files on its own, so let's say you have a gigantic file containing line after line of similar data to your example, you could use:
Code:
awk '{print $4}' filename
and awk will print the 4th field from every line of filename

Have fun!
 
Old 08-12-2010, 12:15 PM   #7
Peufelon
Member
 
Registered: Jul 2005
Posts: 164
Blog Entries: 1

Rep: Reputation: Disabled
Results of a minor "scientific study" show...

Quote:
Originally Posted by GrapefruiTgirl View Post
Yes, more pipes, and more tools in the pipeline, means more processing and more time.
But that effect could be dominated by the fact that awk is bigger and slower than tr. I thought. So I experimented, and you appear to be correct: at least on my system, in one trial, with a toy problem,
Code:
time echo "Aug  9 12:02:59 alpha beta gamma" | tr -s " " | cut -d " " -f 4
alpha

real    0m0.010s
user    0m0.004s
sys     0m0.004s
is slightly slower than
Code:
time echo "Aug  9 12:02:59 alpha beta gamma" | awk '{print $4}'
alpha

real    0m0.008s
user    0m0.000s
sys     0m0.004s
I guess that in a real world application processing some really big log files, it might become clearer which is fastest.
 
Old 08-12-2010, 12:39 PM   #8
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
I did a similar "scientific" study some time ago, wherein I made a loop of 10,000 iterations, and I compared the speed of awk vs. sed vs. tr vs. cut, to grab or produce a particular chunk of a string; I found that for a comparable task, `tr` and `cut` were pretty close to each other in speed, and were the fastest individual tools. sed was next fastest to do the same job, and awk was the slowest of all.

If possible, doing some jobs with your shell alone may be faster than all the above methods, because of no pipelines, and no external processes; but, some jobs can be cumbersome to construct in shell-code alone. I didn't include a shell comparison in my tests.

Over the 10,000 iterations in my test, the speed of one method over any other was not enough to warrant selecting one tool over the others in my opinion; but I didn't observe CPU load, and didn't account for overall binary size of the tools, so maybe those would have been factors to consider in some environments (embedded, laptop, etc..).

Purely scientific study
 
Old 08-12-2010, 01:04 PM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
Well, I was only pointing out an alternate route, not recommending it. And even if this isn't a good use for it, you might find it useful down the line.

Sure awk/gawk may be "bigger" than tr, but it's a tried-and-true, highly optimized application. Actually, it's not even a "program" per se; it's an interpreter for the awk scripting language, so it's performance depends mostly on nature of the script you feed into it. Since you're not giving it much to do here it should be very efficient.

So yeah, the thing causing the most slowdown is surely the opening up of new pipes and subprocesses.

In fact, to really slim things down, why don't we try eliminating external processes altogether?

Code:
$ time { arr=( Aug  9 12:02:59 alpha beta gamma ) && echo ${arr[3]} ; }
alpha

real    0m0.000s
user    0m0.000s
sys     0m0.000s
Bash arrays fer teh win!!

Last edited by David the H.; 08-12-2010 at 01:35 PM. Reason: D'oh! Replaced (..) with {..}, since the first creates a subshell too.
 
Old 08-12-2010, 02:05 PM   #10
Peufelon
Member
 
Registered: Jul 2005
Posts: 164
Blog Entries: 1

Rep: Reputation: Disabled
Quote:
Originally Posted by David the H. View Post
Sure awk/gawk may be "bigger" than tr, but it's a tried-and-true, highly optimized application. Actually, it's not even a "program" per se; it's an interpreter for the awk scripting language, so it's performance depends mostly on nature of the script you feed into it. Since you're not giving it much to do here it should be very efficient.

So yeah, the thing causing the most slowdown is surely the opening up of new pipes and subprocesses.
OK, that makes sense. And thanks for the arrays tip. And GrapefruiTgirl, your study is obviously much more meaningful than my sample of one.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How use CUT -d 'delimiter' is delimiter is a TAB? frenchn00b Programming 12 11-06-2013 03:17 AM
help with cut command using find. Cut last 8 characters leaving the rest ncsuapex Programming 4 09-16-2009 08:55 PM
How use CUT with several chars delimiter (not a single char) ? frenchn00b Programming 6 04-22-2009 10:11 PM
Need 'cut' with mulit char delimiter endfx Programming 5 03-04-2009 01:37 AM
How to use command grep,cut,awk to cut a data from a file? hocheetiong Linux - Newbie 7 09-11-2008 07:16 PM


All times are GMT -5. The time now is 12:26 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration