[SOLVED] SED syntax question

danielbmartin · 02-28-2023, 01:49 PM

This is only a learning exercise.

Given a character string of length 8, produce an output string
which is characters 1, 3, 5, and 7 followed by the input string.

With this InFile...

Code:

abcdefgh
12345678
gxoxoxdx

... the desired OutFile is...

Code:

a c e g abcdefgh
1 3 5 7 12345678
g o o d gxoxoxdx

This gawk works, and is readable.

Code:

gawk -F '' '{print $1,$3,$5,$7,$0}' <$InFile

This sed works, and is readable...

Code:

sed -r 's/(.)(.)(.)(.)(.)(.)(.)(.)/\1 \3 \5 \7 \1\2\3\4\5\6\7\8/' <$InFile

... but it's long and clumsy.

This works, uses fewer keystrokes but is less readable...

Code:

sed -r 's/((.)(.)(.)(.)(.)(.)(.)(.))/\2 \4 \6 \8 \1/' <$InFile

Still fewer keystrokes, but less readable...

Code:

sed -r 's/((.).(.).(.).(.).)/\2 \3 \4 \5 \1/' <$InFile

This doesn't work.

Code:

sed -r 's/(((.).){4})/\2 \3 \4 \5 \1/' <$InFile

Your ideas? Remember, this is just "for funsies" and not an example of good coding practice.

Daniel B. Martin

.

astrogeek · 02-28-2023, 04:56 PM

I come up with this...

Code:

sed -r 'h;s/(.)[^ ]/\1 /g;G;s/[\n\r]+//' <$infile

But I kind of like this one...

Quote:

Originally Posted by danielbmartin

Still fewer keystrokes, but less readable...

Code:

sed -r 's/((.).(.).(.).(.).)/\2 \3 \4 \5 \1/' <$InFile

... given the original problem specifies eight characters this is a kind of simple elegance. What is not readable about that?

boughtonp · 02-28-2023, 05:19 PM

Here you go:

Code:

$ sed -r 'h; s/(.)./\1 /g; G; s/\n//' InFile
a c e g abcdefgh
1 3 5 7 12345678
g o o d gxoxoxdx

The first command is "h" which "holds" the current line - i.e. stores it in a variable.
The second substitutes even-positioned characters with space.
Third we use "G" to returns the the held text - this also adds a newline, so the the final command is needed to remove it.

Heh... guess I spent spent too long looking to see if Sed had a way to not add the newline - just about to post and see Astrogeek has posted almost the same thing.

ntubski · 02-28-2023, 10:11 PM

Quote:

Originally Posted by astrogeek

... given the original problem specifies eight characters this is a kind of simple elegance. What is not readable about that?

I mostly agree, but the outer parens are not needed:

Code:

sed -r 's/(.).(.).(.).(.)./\1 \2 \3 \4 &/' <$InFile

Alternatively, if you allow multiple reads of $InFile:

Code:

sed -r 's/(.)./\1 /g' <$InFile | paste -d' ' - $InFile

astrogeek · 02-28-2023, 10:51 PM

The first character class, [^ ], in my expression is obviously unnecessary (what was I thinking?). Should be replaced with a '.'.

The second, [\n\r], reduces to the single character \n if we assume Unix newlines.

Boughtonp's expression does both of those and is much better as a result.

Quote:

Originally Posted by ntubski

I mostly agree, but the outer parens are not needed:

Code:

sed -r 's/(.).(.).(.).(.)./\1 \2 \3 \4 &/' <$InFile

Of course!

For the problem as stated, this is difficult to beat for simplicity and readability in my opinion.

danielbmartin · 02-28-2023, 10:59 PM

Quote:

Originally Posted by ntubski

Code:

sed -r 's/(.).(.).(.).(.)./\1 \2 \3 \4 &/' <$InFile

Shall we eliminate one more keystroke?!?

Code:

sed -r 's/(.).(.).(.).(.)/\1 \2 \3 \4 &/' <$InFile

Daniel B. Martin

.

boughtonp · 03-01-2023, 09:28 AM

Quote:

Originally Posted by astrogeek

The second, [\n\r], reduces to the single character \n if we assume Unix newlines.

I always assume newlines only, and treat carriage returns as a bug to be removed. :)

Quote:

Originally Posted by danielbmartin

Shall we eliminate one more keystroke?!?

Code:

sed -r 's/(.).(.).(.).(.)/\1 \2 \3 \4 &/' <$InFile

I'd say that ignoring the final character makes it less clear what the intent is, and also less maintainable - not worth it for a single dot.

However, what can be removed is the redirect via stdin, since Sed can read files directly. (Also not sure why it's a variable; and in a real script a filename variable must be double-quoted.)

If the shortest number of keystrokes/characters is the goal, the presence of only a single group in the solution astrogeek and I came up with means removing the -r actually results in a one-character shorter command:

Code:

sed -r 'h;s/(.)./\1 /g;G;s/\n//' InFile
sed 'h;s/\(.\)./\1 /g;G;s/\n//' InFile

Unless I've overlooked something in the Sed manual, I suspect if going shorter is possible, it would require a different method/tool.

Here's one such option that doesn't fully match the example OutFile, (but does adhere to "produce an output string which is characters 1, 3, 5, and 7 followed by the input string."):

Code:

$ paste <(cut -c1,3,5,7 InFile) InFile
aceg    abcdefgh
1357    12345678
good    gxoxoxdx

MadeInGermany · 03-01-2023, 10:35 AM

Quote:

Originally Posted by boughtonp

Here you go:
...
Heh... guess I spent spent too long looking to see if Sed had a way to not add the newline ...

The newline is put in between because it can be easily removed OR MODIFIED.
Also H and N put the newline.
Examples where the newline is useful:

Code:

sed -r 'h;s/(.)./\1 /g;G;s/(.*)\n(.*)/\2:\1/'
sed -r 'h;s/(.)./\1 /g;H;x;s/\n/:/'

[\n] only works in few sed versions; in Posix sed it means \ or n
GNU sed sticks to Posix if the environment variable POSIXLY_CORRECT is set.
A plain \n must work after G H N.
A plain \n without prior G H N works in some sed versions. (A Unix sed needs an ending \ and a new line.)

danielbmartin · 03-01-2023, 05:07 PM

From the outset this problem was described as a learning experience. There was a wish for a more concise sed (i.e. fewer keystrokes). No mention was made of execution speed.

This thread certainly has been a learning experience!

An InFile was created with 250,000 lines with one 8-character
string per line. The timing results:

Code:

Method #1 of LQ member danielbmartin.
gawk -F ' ' '{print $1,$3,$5,$7,$0}'

real    0m0.300s
user    0m0.284s
sys    0m0.012s

Method #2 of LQ member danielbmartin.
sed -r 's/(.)(.)(.)(.)(.)(.)(.)(.)/\1 \3 \5 \7 \1\2\3\4\5\6\7\8/'

real    0m0.362s
user    0m0.340s
sys    0m0.008s

Method #3 of LQ member danielbmartin.
sed -r 's/((.)(.)(.)(.)(.)(.)(.)(.))/\2 \4 \6 \8 \1/'

real    0m0.371s
user    0m0.340s
sys    0m0.012s

Method #4 of LQ member danielbmartin.
sed -r 's/((.).(.).(.).(.).)/\2 \3 \4 \5 \1/'

real    0m0.293s
user    0m0.260s
sys    0m0.012s

Method #1 of LQ Moderator astrogeek.
sed -r 'h;s/(.)[^ ]/\1 /g;G;s/[\n\r]+//'

real    0m1.371s
user    0m1.340s
sys    0m0.016s

Method #1 of LQ Senior Member boughtonp.
sed -r 'h; s/(.)./\1 /g; G; s/\n//'

real    0m0.595s
user    0m0.568s
sys    0m0.008s

Method #2 of LQ Senior Member boughtonp.
sed 'h;s/\(.\)./\1 /g;G;s/\n//'

real    0m0.592s
user    0m0.556s
sys    0m0.016s

Method #3 of LQ Senior Member boughtonp.
paste <(cut -c1,3,5,7 $InFile)

real    0m0.039s
user    0m0.024s
sys    0m0.012s

Method #1 of LQ Senior Member ntubski.
sed -r 's/(.).(.).(.).(.)./\1 \2 \3 \4 &/'

real    0m0.273s
user    0m0.256s
sys    0m0.000s

Method #1.1 of LQ Senior Member ntubski.
sed -r 's/(.).(.).(.).(.)/\1 \2 \3 \4 &/'

real    0m0.268s
user    0m0.244s
sys    0m0.008s

Method #2 of LQ Senior Member ntubski.
sed -r 's/(.)./\1 /g' <$InFile | paste -d' '

real    0m0.503s
user    0m0.524s
sys    0m0.000s

Method #1 of LQ Senior Member MadeInGermany.
sed -r 'h;s/(.)./\1 /g;G;s/(.*)\n(.*)/\2:\1/'

real    0m1.631s
user    0m1.604s
sys    0m0.008s

Method #2 of LQ Senior Member MadeInGermany.
sed -r 'h;s/(.)./\1 /g;H;x;s/\n/:/'

real    0m0.602s
user    0m0.576s
sys    0m0.004s

Method #3 of boughtonp was a double winner -- the fastest
and the most concise. That solution took the liberty of
redefining the format of the OutFile but no function was lost.

Famous saying: "Beauty is in the eye of the beholder."
The same might be said of readability.

Daniel B. Martin

.

MadeInGermany · 03-02-2023, 04:21 AM

A variation of method #3 of boughtonp:

Code:

cut -c1,3,5,7 $InFile | paste -d" " - $InFile

danielbmartin · 03-02-2023, 08:55 PM

Quote:

Originally Posted by MadeInGermany

A variation of method #3 of boughtonp:

Code:

cut -c1,3,5,7 $InFile | paste -d" " - $InFile

Clean. Readable. Fast.

It would be nice if cut allowed overlay usages such as ..

Code:

cut -c1,3,5,7,1-8

Daniel B. Martin

.

astrogeek · 03-02-2023, 09:35 PM

You may want to reconsider the thread title: SED syntax question

Great solutions to the problem however!

pan64 · 03-03-2023, 01:04 AM

regarding performance:
using two tools instead of only one costs more, so if possible try to use only one. (in this case cut and paste together). Using the shell itself is probably even faster.
Do not try to measure performance on a few lines (because you will not be able to produce interpretable result), but millions of data.
Another thing (which is not that important at all) you can omit that < in most cases, awk, sed, grep, ... can handle files, so

Code:

awk 'script' file
# and
awk 'script' <file

are almost identical. (in the first case awk will open the file and in the second case the shell will open it and pass the file handler to awk).
As usual you can solve it in another languages too, like perl or python, but don't forget shell can do that too.

Code:

while read -r line;
do
    echo "${line:0:1} ${line:2:1} ${line:4:1} ${line:6:1} ${line}"
done <inputfile

ntubski · 03-03-2023, 06:04 AM

Quote:

Originally Posted by pan64

regarding performance:
using two tools instead of only one costs more, so if possible try to use only one. (in this case cut and paste together). Using the shell itself is probably even faster.
Do not try to measure performance on a few lines (because you will not be able to produce interpretable result), but millions of data.

I would predict that if you measure on a file with millions of lines, the shell solution will be much much slower (something like 10 times slower). Shell would only be faster compared to a solution that runs sed/cut/paste/whatever once per line.

pan64 · 03-03-2023, 07:44 AM

Quote:

Originally Posted by ntubski

I would predict that if you measure on a file with millions of lines, the shell solution will be much much slower (something like 10 times slower). Shell would only be faster compared to a solution that runs sed/cut/paste/whatever once per line.

You can definitely try it. using regexp is much slower than ${var:x:y}, so I'm not really sure about that.