ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
You may want to reconsider the thread title: SED syntax question
Great solutions to the problem however!
Yes, the thread veered off course but that's okay. Each member contributed new ideas.
This thread is not marked SOLVED because we might still get back to sed syntax.
I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.
It's more accurate to say that using multiple tools has an overhead of transferring data between them, which can reduce performance. Two fast tools with an overhead may still end up faster than a single slow tool.
However, I'm intrigued that the cut+paste solution appears so much faster. If Shruggy was still around he would already have pointed out that mawk is faster than gawk.
Quote:
Do not try to measure performance on a few lines (because you will not be able to produce interpretable result), but millions of data.
Or put another way: non-real testing does not predict real world performance.
As a learning exercise, there's no "real world" target here, but it's useful to remember a few things...
* Small amounts of data/iterations can mask random fluctuations, and also doesn't reveal overheads that affect how code scales.
* There's a difference between one file with a million lines, a million files with one line, (and a thousand files with a thousand lines).
* Many systems are complex and simply looping lots of times often does not reflect what a live system will be doing.
* Sometimes the right choice can be to reduce performance of a single task in order to improve overall system performance.
When performance matters, use profiling tools on a controlled replica of the live environment, with accurate data, behaviours, and measurements - and then one gets a better idea of the ideal areas to focus optimization efforts. Of course writing that is easier than doing it - especially if there are unknown/unexpected usage peaks that mean what one thinks is accurate user behaviour doesn't actually apply at the critical times, or affects resources one isn't monitoring.
Anyhow, this is probably veering too far off-track on a thread that started about readability, so I'll stop there. :)
Yes, the thread veered off course but that's okay. Each member contributed new ideas.
This thread is not marked SOLVED because we might still get back to sed syntax.
I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.
I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.
The {4} is called called a quantifier.
(I use the term numeric quantifier, to differentiate from the shorthand quantifiers (? * +), but I'm not sure how widespread that usage is.)
The issue here is that quantifying a capturing group doesn't give you more groups, it only changes what is stored in the single group.
I've not encountered a regex implementation that has any mechanism for capturing multiple groups without explicitly defining the separate groups.
Interesting question. I had to test this out.
Yes, the quantifier {x,y} tries to match the preceding character or group as often as possible (greedy) between x and y times.
And obviously the reference to a group gets the last matched one.
The following demonstrates it:
Code:
echo abcdefgh | sed -r 's/(..)*/\1/'
gh
First repetition then reference to the last match.
Last edited by MadeInGermany; 03-03-2023 at 09:03 AM.
You can definitely try it. using regexp is much slower than ${var:x:y}, so I'm not really sure about that.
Okay, I'll try not to gloat too much, but I believe I am entitled to an "I told you so".
Code:
$ cat bench.bash
#!/bin/bash
yes abcdefgh | head -n 250000 > input.txt
echo time "sed -r 's/(.)./\1 /g' | paste ..."
time sed -r 's/(.)./\1 /g' input.txt | paste -d' ' - input.txt >/dev/null
echo time while read -r line ...
time while read -r line;
do
echo "${line:0:1} ${line:2:1} ${line:4:1} ${line:6:1} ${line}"
done <input.txt >/dev/null
$ ./bench.bash
time sed -r 's/(.)./\1 /g' | paste ...
real 0m0.504s
user 0m0.478s
sys 0m0.020s
time while read -r line ...
real 0m7.582s
user 0m6.344s
sys 0m1.190s
Okay, I'll try not to gloat too much, but I believe I am entitled to an "I told you so".
I wrote a few lines about it, but dropped. So here it is again:
1. regex is in general slow, solving an issue without regex probably will be faster (obviously only if possible).
2. forking new processes is extremely expensive (compared to a few lines of code using built-in functions)
3. any kind of execution depends on the amount of data, type of the data (length, whatever), complexity of regexp, so there is no general rule here
4. also everything depends on the available resources, load
5. not to speak about the cache, if the tools (files) we use are already cached or not (or if they are located on a remote host...)
6. finally one single measurement is not enough to say anything
but anyway, probably you are right, bash itself is terrible slow for that kind of operation.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.