[SOLVED] SED syntax question

danielbmartin · 03-03-2023, 08:26 AM

Quote:

Originally Posted by astrogeek

You may want to reconsider the thread title: SED syntax question

Great solutions to the problem however!

Yes, the thread veered off course but that's okay. Each member contributed new ideas.

This thread is not marked SOLVED because we might still get back to sed syntax.

I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.

Daniel B. Martin

.

boughtonp · 03-03-2023, 08:29 AM

Quote:

Originally Posted by pan64

using two tools instead of only one costs more

It's more accurate to say that using multiple tools has an overhead of transferring data between them, which can reduce performance. Two fast tools with an overhead may still end up faster than a single slow tool.

However, I'm intrigued that the cut+paste solution appears so much faster. If Shruggy was still around he would already have pointed out that mawk is faster than gawk.

Quote:

Do not try to measure performance on a few lines (because you will not be able to produce interpretable result), but millions of data.

Or put another way: non-real testing does not predict real world performance.

As a learning exercise, there's no "real world" target here, but it's useful to remember a few things...

* Small amounts of data/iterations can mask random fluctuations, and also doesn't reveal overheads that affect how code scales.
* There's a difference between one file with a million lines, a million files with one line, (and a thousand files with a thousand lines).
* Many systems are complex and simply looping lots of times often does not reflect what a live system will be doing.
* Sometimes the right choice can be to reduce performance of a single task in order to improve overall system performance.

When performance matters, use profiling tools on a controlled replica of the live environment, with accurate data, behaviours, and measurements - and then one gets a better idea of the ideal areas to focus optimization efforts. Of course writing that is easier than doing it - especially if there are unknown/unexpected usage peaks that mean what one thinks is accurate user behaviour doesn't actually apply at the critical times, or affects resources one isn't monitoring.

Anyhow, this is probably veering too far off-track on a thread that started about readability, so I'll stop there. :)

pan64 · 03-03-2023, 08:34 AM

Quote:

Originally Posted by danielbmartin

Yes, the thread veered off course but that's okay. Each member contributed new ideas.

This thread is not marked SOLVED because we might still get back to sed syntax.

I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.

Daniel B. Martin

.

this is a simple regexp construct

Code:

(, ){4}
# means
(, )(, )(, )(, )

That's all. In general you can specify two numbers {x,y} where x means the minimal, y the maximal number of occurrences.
https://www.regular-expressions.info/repeat.html

boughtonp · 03-03-2023, 08:45 AM

Quote:

Originally Posted by danielbmartin

I don't understand how to use (. ){4} to indicate "repeat four times." No knowing the proper name of this language construct I couldn't do an effective search. Please advise.

The {4} is called called a quantifier.

(I use the term numeric quantifier, to differentiate from the shorthand quantifiers (? * +), but I'm not sure how widespread that usage is.)

The issue here is that quantifying a capturing group doesn't give you more groups, it only changes what is stored in the single group.

I've not encountered a regex implementation that has any mechanism for capturing multiple groups without explicitly defining the separate groups.

MadeInGermany · 03-03-2023, 08:54 AM

Interesting question. I had to test this out.
Yes, the quantifier {x,y} tries to match the preceding character or group as often as possible (greedy) between x and y times.
And obviously the reference to a group gets the last matched one.
The following demonstrates it:

Code:

echo abcdefgh | sed -r 's/(..)*/\1/'
gh

First repetition then reference to the last match.

ntubski · 03-03-2023, 04:28 PM

Quote:

Originally Posted by pan64

You can definitely try it. using regexp is much slower than ${var:x:y}, so I'm not really sure about that.

Okay, I'll try not to gloat too much, but I believe I am entitled to an "I told you so".

Code:

$ cat bench.bash
#!/bin/bash

yes abcdefgh | head -n 250000 > input.txt

echo time "sed -r 's/(.)./\1 /g' | paste ..."
time sed -r 's/(.)./\1 /g' input.txt | paste -d' ' - input.txt >/dev/null


echo time while read -r line ...
time while read -r line;
do
    echo "${line:0:1} ${line:2:1} ${line:4:1} ${line:6:1} ${line}"
done <input.txt >/dev/null

$ ./bench.bash
time sed -r 's/(.)./\1 /g' | paste ...

real    0m0.504s
user    0m0.478s
sys     0m0.020s
time while read -r line ...

real    0m7.582s
user    0m6.344s
sys     0m1.190s

pan64 · 03-04-2023, 03:20 AM

Quote:

Originally Posted by ntubski

Okay, I'll try not to gloat too much, but I believe I am entitled to an "I told you so".

I wrote a few lines about it, but dropped. So here it is again:
1. regex is in general slow, solving an issue without regex probably will be faster (obviously only if possible).
2. forking new processes is extremely expensive (compared to a few lines of code using built-in functions)
3. any kind of execution depends on the amount of data, type of the data (length, whatever), complexity of regexp, so there is no general rule here
4. also everything depends on the available resources, load
5. not to speak about the cache, if the tools (files) we use are already cached or not (or if they are located on a remote host...)
6. finally one single measurement is not enough to say anything

but anyway, probably you are right, bash itself is terrible slow for that kind of operation.