Text processing using a pattern. sed?

danielbmartin · 07-02-2015, 10:47 PM

This is recreational programming. Just for "funsies." I already have a working solution using awk but it is large and clumsy. Maybe there is a clean solution using sed with a clever RegEx.

The input file is a word list, one word per line, all lower case, already in sorted order. Every word has a trailing blank.

I invite the user to enter a pattern of the form .hi..
This means find all five-letter words which have "hi" in positions 2-3. If the pattern is in variable w1 this is easily done with

Code:

grep ^$w1" " $InFile

The result is a list of words such as ...

Code:

chick
chide
chief
child

All have "hi" in positions 2-3.

Now the interesting part: it is desired to remove all those "hi" strings to produce

Code:

cde
cef
cde
cef

Remember, we are using a pattern so cannot hard-code the positions 2-3. Is there a slick way to do this?

The next question is the inverse. After fiddling around with the interim result it is desired to restore the "hi" in positions 2-3, with the "hi" in upper case.

Code:

cHIck
cHIde
cHIef
cHIld

Is there a slick way to do this?

Daniel B. Martin

astrogeek · 07-02-2015, 11:30 PM

Is the input pattern restricted to be of the form .hi...?

Must the letters in the input pattern be contiguous, or must .h.i.. and similar cases be handled as well?

Would the restoration be a separate command working on a file to which the result of the first operation was written? If so, would the user be required to give the input pattern, possibly different, as a separate operation?

Alternatively, should the original operation generate the uppercase replacement pattern and produce the restored result as part of the same process?

grail · 07-03-2015, 12:23 AM

In addition to above, instead of restoring the desired letters, could we use the original file and simply upper case those required?

danielbmartin · 07-03-2015, 06:27 AM

Quote:

Originally Posted by astrogeek

Must the letters in the input pattern be contiguous, or must .h.i.. and similar cases be handled as well?

Contiguous.

Quote:

Would the restoration be a separate command working on a file to which the result of the first operation was written?

The file produced by the first command is reduced in size. Many words are "weeded out." The second command works on this subset.

Quote:

If so, would the user be required to give the input pattern, possibly different, as a separate operation?

The same pattern applies to both commands.

Daniel B. Martin

danielbmartin · 07-03-2015, 06:29 AM

Quote:

Originally Posted by grail

In addition to above, instead of restoring the desired letters, could we use the original file and simply upper case those required?

No.

Daniel B. Martin

pan64 · 07-03-2015, 07:13 AM

I would construct 2 additional regexp, one for the "before" part, and one for the "after" part. And now you can play with those groups.

Code:

$a="."
$b="hi"
$c=".. "

sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" filename

and do something similar for the other

danielbmartin · 07-03-2015, 07:13 AM

Perhaps it will help to describe the overall application. This coding challenge is inspired by a series of published word puzzles called Split Decisions. Refer to this example:
http://www.macnamarasband.com/split/sd010.html

My present (working) solution might be an example of "doing an easy thing the hard way."

Start with these two strings: _hi__ and _ra__

Choose all word pairs from the input file which have the same letters in the blank positions. The list might be ...

Code:

cHIck cRAck
cHImp cRAmp
cHIne cRAne
cHInk cRAnk
cHIps cRAps
cHIve cRAve
tHIck tRAck
wHIps wRAps

Daniel B. Martin

ntubski · 07-03-2015, 08:04 AM

I think the following would solve it (but is the awk part too "large and clumsy"?)

Code:

sed -nr 's/^(.)(hi|ra)(..)$/\L\1\U\2\E\3/p' "$InFile" | awk '{
 letters=$0; gsub(/[A-Z]/, "", letters);
 if (letters in matches)
    print (matches[letters] = matches[letters] " " $0);
 else
    matches[letters] = $0;
}'

The sed relies on the GNU extension to the s// command:

Quote:

The s Command
\L

Turn the replacement to lowercase until a \U or \E is found,

\U

Turn the replacement to uppercase until a \L or \E is found,

Otherwise, the sed part uses basically the same approach as pan64's.

Using http://www.cs.duke.edu/~ola/ap/linuxwords for $InFile I get

Code:

cHIck cRAck
cHInk cRAnk
tHIck tRAck
wHIps wRAps

danielbmartin · 07-03-2015, 03:16 PM

Quote:

Originally Posted by pan64

Code:

$a="."
$b="hi"
$c=".. "

sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" filename

The sed looks good but I am unable to make it work. Embedding it in a bash script it looks like this ...

Code:

echo; echo "Method of LQ Guru pan64."
a="."
b="hi"
c=".. "
sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" $InFile >$Work7
echo "The number of lines in InFile is" $(wc -l <$InFile)
echo "The number of lines in Work7 is " $(wc -l <$Work7)

... and produces this result ...

Code:

Method of LQ Guru pan64.
The number of lines in InFile is 119557
The number of lines in Work7 is  119557

The entire InFile was copied to the OutFile.

Please advise.

Daniel B. Martin

ntubski · 07-03-2015, 03:28 PM

Code:

sed -n "s/^\($a\)\($b\)\($c\)$/\1\3/p" filename

danielbmartin · 07-03-2015, 03:47 PM

Quote:

Originally Posted by ntubski

Code:

sed -n "s/^\($a\)\($b\)\($c\)$/\1\3/p" filename

Much better! Thank you!

Daniel B. Martin

danielbmartin · 07-03-2015, 04:03 PM

[QUOTE=ntubski;5386678]This sed ...

Code:

echo; echo "Method #1 of LQ Senior Member ntubski."
sed -nr 's/^(.)(hi|ra)(.. )$/\L\1\U\2\E\3/p' "$InFile"  \
| awk '{letters=$0; gsub(/[A-Z]/, "", letters);
    if (letters in matches)
    print (matches[letters] = matches[letters] " " $0);
 else
    matches[letters] = $0;
}' >$Work6

... produces a storm of character strings but nothing like the good result you obtained.
Perhaps my PC lacks the required GNU extension.

Code:

daniel@daniel-desktop:~$ sed --version
GNU sed version 4.2.1

Daniel B. Martin

firstfire · 07-03-2015, 04:43 PM

Hi.

Here are two bash scripts for forward and backward transformation respectively and sample runs:

Code:

$ cat script1.sh
#!/bin/bash
pat="$1"
shift
p=$(echo $pat | sed -r 's/\.+|[^.]+/(&)/g')
sed -rn "s/^$p\$/\1\3/p" $@

$ cat ./script2.sh
#!/bin/bash
pat="$1"
shift
sed -rn "s/.*/@&@$pat/; :a; s/@(.)(.*)@\./\1@\2@/; s/@(.+)@([^.])/\U\2\E@\1@/;  ta; s/@@$//p" $@

# Example of forward transformation
$ ./script1.sh .hi.. /usr/share/dict/american-english
C's
Cba
Cle
Cmu
Cna
Rne
Sva
Teu
Wgs
Wte
cck
cde
cef
cld
cle
cli
cll
cme
cmp
cna
cnk
cno
cns
...

# Forward and backward transformation
$ ./script1.sh .hi.. /usr/share/dict/american-english  | ./script2.sh .hi..
CHI's
CHIba
CHIle
CHImu
CHIna
RHIne
SHIva
THIeu
WHIgs
WHIte
cHIck
cHIde
cHIef
cHIld
cHIle
cHIli
cHIll
cHIme
cHImp
cHIna
cHInk
cHIno
cHIns
...

Second script is probably the most opaque/unreadable/inefficient one here

Edit: of course we can do something like this (for reverse transform)

Code:

$ cat script3.sh
#!/bin/bash
pat="$1"
read a b c <<<$(echo $pat | sed -r 's/\b/ /g')
shift
sed -r "s/($a)($c)/\1\U$b\E\2/" $@

but it is not as funny..

grail · 07-03-2015, 05:43 PM

Well we could go all bash (for something different

):

Code:

#!/usr/bin/env bash

declare -A found

a=.
b='hi|ra'
c=..

regex="^($a)($b)($c)$"
space=' '

while read line
do
  if [[ "$line" =~ $regex ]]
  then
    ind=${BASH_REMATCH[1],,}${BASH_REMATCH[3]}
    word=${BASH_REMATCH[1],,}${BASH_REMATCH[2]^^}${BASH_REMATCH[3]}

    [[ -n "${found[$ind]}" ]] && found[$ind]="${found[$ind]} $word" || found[$ind]="$word"
  fi
done<linuxwords

for entry in "${found[@]}"
do
  [[ "$entry" =~ $space ]] && echo "$entry"
done

Personally I would probably use something like:

Code:

ruby -ne 'a ||= {};$_.downcase!;if /^(.)(hi|ra)(..)$/; a[$1 + $3] ||= [];a[$1 + $3] << $1 + $2.upcase + $3;end;END{a.each_value{|v| puts v.join(" ") if v.size > 1}}' linuxwords

ntubski · 07-04-2015, 12:26 AM

Quote:

Originally Posted by danielbmartin

... produces a storm of character strings but nothing like the good result you obtained.

Interesting, when I run what you posted I get only

Code:

Method #1 of LQ Senior Member ntubski.

When I remove that extra space you added to the pattern it works.

Quote:

Perhaps my PC lacks the required GNU extension.

Code:

daniel@daniel-desktop:~$ sed --version
GNU sed version 4.2.1

I have the exact same sed version.

EDIT:
----------------------
I think the most likely thing is you have very different $InFile from me, if you post the output of

Code:

sed -nr 's/^(.)(hi|ra)(.. )$/\L\1\U\2\E\3/p' "$InFile" | head

that should be enough to figure out what needs tweaking.