LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   When both sed and awk will do, which one to chose? (https://www.linuxquestions.org/questions/linux-software-2/when-both-sed-and-awk-will-do-which-one-to-chose-4175689332/)

arifd86 01-27-2021 01:54 AM

When both sed and awk will do, which one to chose?
 
I have the inclination that sed is the lighterweight of the two and thus should be used.

But my sed command uses a wildcard, whereas awk doesn't need to. (However, i don't know if behind the scenes they're both doing regex and thus the search pattern in itself is of equal speed)

here's what I'm doing: (want to return only the number)
Code:

echo '    index: 56' | sed 's/.* index: //'
Code:

echo '    index: 56' | awk '/index:/{print $2}'
edit prepending with time, awk seems to consistently beat sed by 0.001s

Turbocapitalist 01-27-2021 02:17 AM

I was just going to mention the speed and CPU load aspect. Most of the time the two are about the same, but your pattern is more problematic in sed this time. Here are the times for more repetitions on a very slow piece of hardware:

Code:

$ time for i in $(seq 1 10000); do echo '    index: 56' | sed 's/.* index: //';  done

...

real    5m19.441s
user    1m37.236s
sys    2m35.147s

versus

Code:

$ time for i in $(seq 1 10000); echo '    index: 56' | awk '/index:/{print $2}';  done

...

real    4m14.199s
user    1m5.048s
sys    2m8.869s

versus

Code:

$ time for i in $(seq 1 10000); echo '    index: 56' | awk '$1=="index:"{print $2}';  done

...

real    4m14.731s
user    1m4.882s
sys    2m8.165s

Edit: the above was with MAWK on a Raspberry Pi ZeroW. The slow processor amplifies the differences.

Code:

$ realpath $(which awk)
/usr/bin/mawk


syg00 01-27-2021 03:09 AM

Poorly constructed regex causing excessive backtracking is gunna have a cost, no doubt about it. I prefer to search for what I want rather than what I don't want - especially when you can use anchors.
But more generally, it it's field based data, use awk. KISS.

arifd86 01-27-2021 03:41 AM

Right, sed feels simpler though, thus I thought i was KISSING ;)

shruggy 01-27-2021 04:01 AM

Depends on awk flavor, actually. mawk which is default awk in everything Debian-based is very fast and generally on par with sed. Heck, it even beats grep -o '\w*$' in this case!

OTOH, gawk (default awk in Fedora-based distros) not so much.

syg00 01-27-2021 05:13 AM

Don't use "*" in regex. Ever.
Wellll - maybe once I found a valid use. It introduces zero-length matches (and backtracking) you really don't want unless you really do need it. But you'd better be able to justify it.

Steps off hobby-horse ...

boughtonp 01-27-2021 07:42 AM

Quote:

Originally Posted by arifd86 (Post 6212448)
When both sed and awk will do, which one to chose?

Whichever one you prefer!

If you have performance critical code, use real world application behaviour to profile both options and make your decision that way.
(I'd be surprised if choosing between awk and sed was your biggest issue.)



For specifically obtaining the number from that string, the simplest matching regex is "\d+", but it can't be expressed that cleanly in many command line tools - compare the readability of:
Code:

grep -oP '\d+'
grep -oE '[0-9]+'
grep -o '[0-9]\{1,\}'
sed 's/[^0-9]\{1,\}//'

Of course, to get just the last field from a string, "awk '{print $NF}'" might be the simplest option, if the undescribed data format allows that.


boughtonp 01-27-2021 07:46 AM

Quote:

Originally Posted by shruggy (Post 6212486)
Depends on awk flavor, actually. mawk which is default awk in everything Debian-based...

My Debian uses gawk, which I don't recall ever configuring, and entering "awk" at https://manpages.debian.org/ brings up the gawk manpage.


shruggy 01-27-2021 08:21 AM

@boughtonp. If you installed gawk from the repo, it would become the new default, because it has higher priority in update-alternatives. But what gets installed during installation of Debian is mawk.

arifd86 01-27-2021 09:44 AM

Thanks for all the insight everyone, this has come especially handy, because I don't know why now, but now the string is this:
Code:

echo '  * index: 3'
And I have no idea why
Code:

echo '  * index: 3' | awk '/index:/{print $2}' # doesn't work
# use {print $3} if you want it to work.

but
Code:

echo '  * index: 3' | sed 's/.* index: //' # works
So going to use `awk '{print $NF}'`.
(for those who are curious, it is the output to pulseaudio's `pacmd list-sinks` and `pacmd list-sink-inputs`

shruggy 01-27-2021 09:56 AM

With awk '{print$2}' you would get index: because it is the second field.

hish2021 01-28-2021 08:19 AM

Quote:

Originally Posted by boughtonp (Post 6212563)
My Debian uses gawk, which I don't recall ever configuring, and entering "awk" at https://manpages.debian.org/ brings up the gawk manpage.

Interesting! When I enter "awk" in https://manpages.debian.org/, I'm taken to https://manpages.debian.org/buster/o.../awk.1.en.html.

Anyway, you probably installed an application that pulled in gawk as a dependency. Looking through `apt rdepends gawk | grep Depends` might offer a clue.

boughtonp 01-28-2021 10:06 AM

Quote:

Originally Posted by hish2021 (Post 6212999)

Heh, weird - it's now doing that for me too, but it definitely went to gawk's page before.

Previously I didn't go direct, so maybe the route I took somehow set a cookie (but if so I can't replicate it now, nor see why Debian would do something like that).


Quote:

Anyway, you probably installed an application that pulled in gawk as a dependency. Looking through `apt rdepends gawk | grep Depends` might offer a clue.
There's nothing in the list that outputs which I've installed myself, but I guess there could be secondary or tertiary dependencies, and I don't feel like checking them all.

I also wouldn't expect installing a dependency to change a default like this, but not too bothered by it.


hish2021 01-28-2021 06:32 PM

Quote:

Originally Posted by boughtonp (Post 6213033)
...
There's nothing in the list that outputs which I've installed myself, but I guess there could be secondary or tertiary dependencies, and I don't feel like checking them all...

I've modified my logrotates to keep **all** my dpkg & apt logs. So, for me, code like this helps dig out when I installed something:
Code:

#!/bin/bash

echo "enter the package name;"; echo "use .* as prefix/suffix if the exact package name is not known"
read the_string
[ "$the_string" ] || { echo "You forgot the search string!" ; exit 1 ; }
zgrep -E "status (not-)?installed $the_string:" /var/log/dpkg.log* | sed 's/:/: /' | sort -k2,3 -r | column -t



All times are GMT -5. The time now is 12:45 PM.