[SOLVED] Squeeze out repeated characters

danielbmartin · 08-27-2015, 10:01 AM

This post pertains to a learning exercise. Just for "funsies."

Have: a file with one word per line.
Example:

Code:

success
failure

Want: the same file with repeats of same character "squeezed out."
Example:

Code:

suces
failure

This may be done with tr ...

Code:

tr -s "[a-z]" <$InFile >$OutFile

... or with sed ...

Code:

sed 's/\(.\)\1/\1/g' <$InFile >$OutFile

I tried to perform the same "squeeze" with awk and gsub but could not get the syntax right. Please advise.

Daniel B. Martin

grail · 08-27-2015, 11:09 AM

gsub does not allow back referencing, so you can either try gensub (which does) or set FS to null and loop over word removing repetition.

danielbmartin · 08-27-2015, 01:57 PM

Quote:

Originally Posted by grail

... try gensub ...

This sed works ...

Code:

sed 's/\(.\)\1/\1/g' $InFile >$OutFile

... so I "borrowed" the RegEx for use with gensub ...

Code:

gawk '{$0=gensub(/\(.\)\1/,"\\1","g"); print $0}' $InFile >$OutFile

... but this doesn't change the InFile at all. It behaves as if the RegEx never matches.

I thought this variation ...

Code:

gawk '{$0=gensub(/\(.\)\1/,"","g"); print $0}' $InFile >$OutFile

... would remove both letter pairs, changing success to sue but it doesn't.

Daniel B. Martin

ntubski · 08-27-2015, 03:44 PM

gawk doesn't use backslashes before grouping parens. But note that gensub supports referencing captures in the replacement, but still doesn't support backreferences in the pattern so you can't really solve this nicely. For example the following squeezes multiple c and s, but not other letters:

Code:

gawk '{ print(gensub(/(c)c|(s)s/, "\\1\\2", "g")) }'

grail · 08-28-2015, 04:30 AM

My bad there. Just was thinking of what does do referencing and not where it was being applied. ntubski is on the money

You will need to stick with my second option

You could of course try Perl or Ruby as alternatives

danielbmartin · 08-30-2015, 11:09 AM

The Original Post asked for a way to perform the "squeeze" with awk and gsub. The best minds on this forum say it's not possible. That makes the question resolved. Not truly solved, but resolved. Thanks to all.

Daniel B. Martin