LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-02-2015, 10:47 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Text processing using a pattern. sed?


This is recreational programming. Just for "funsies." I already have a working solution using awk but it is large and clumsy. Maybe there is a clean solution using sed with a clever RegEx.

The input file is a word list, one word per line, all lower case, already in sorted order. Every word has a trailing blank.

I invite the user to enter a pattern of the form .hi..
This means find all five-letter words which have "hi" in positions 2-3. If the pattern is in variable w1 this is easily done with
Code:
grep ^$w1" " $InFile
The result is a list of words such as ...
Code:
chick
chide
chief
child
All have "hi" in positions 2-3.

Now the interesting part: it is desired to remove all those "hi" strings to produce
Code:
cde
cef
cde
cef
Remember, we are using a pattern so cannot hard-code the positions 2-3. Is there a slick way to do this?

The next question is the inverse. After fiddling around with the interim result it is desired to restore the "hi" in positions 2-3, with the "hi" in upper case.
Code:
cHIck
cHIde
cHIef
cHIld
Is there a slick way to do this?

Daniel B. Martin
 
Old 07-02-2015, 11:30 PM   #2
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
Is the input pattern restricted to be of the form .hi...?

Must the letters in the input pattern be contiguous, or must .h.i.. and similar cases be handled as well?

Would the restoration be a separate command working on a file to which the result of the first operation was written? If so, would the user be required to give the input pattern, possibly different, as a separate operation?

Alternatively, should the original operation generate the uppercase replacement pattern and produce the restored result as part of the same process?
 
Old 07-03-2015, 12:23 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
In addition to above, instead of restoring the desired letters, could we use the original file and simply upper case those required?
 
Old 07-03-2015, 06:27 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by astrogeek View Post
Must the letters in the input pattern be contiguous, or must .h.i.. and similar cases be handled as well?
Contiguous.
Quote:
Would the restoration be a separate command working on a file to which the result of the first operation was written?
The file produced by the first command is reduced in size. Many words are "weeded out." The second command works on this subset.
Quote:
If so, would the user be required to give the input pattern, possibly different, as a separate operation?
The same pattern applies to both commands.

Daniel B. Martin
 
Old 07-03-2015, 06:29 AM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by grail View Post
In addition to above, instead of restoring the desired letters, could we use the original file and simply upper case those required?
No.

Daniel B. Martin
 
Old 07-03-2015, 07:13 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,792

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
I would construct 2 additional regexp, one for the "before" part, and one for the "after" part. And now you can play with those groups.
Code:
$a="."
$b="hi"
$c=".. "

sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" filename
and do something similar for the other
 
1 members found this post helpful.
Old 07-03-2015, 07:13 AM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Perhaps it will help to describe the overall application. This coding challenge is inspired by a series of published word puzzles called Split Decisions. Refer to this example:
http://www.macnamarasband.com/split/sd010.html

My present (working) solution might be an example of "doing an easy thing the hard way."

Start with these two strings: _hi__ and _ra__

Choose all word pairs from the input file which have the same letters in the blank positions. The list might be ...
Code:
cHIck cRAck
cHImp cRAmp
cHIne cRAne
cHInk cRAnk
cHIps cRAps
cHIve cRAve
tHIck tRAck
wHIps wRAps
Daniel B. Martin

Last edited by danielbmartin; 07-03-2015 at 10:35 AM. Reason: Minor clarification
 
Old 07-03-2015, 08:04 AM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
I think the following would solve it (but is the awk part too "large and clumsy"?)

Code:
sed -nr 's/^(.)(hi|ra)(..)$/\L\1\U\2\E\3/p' "$InFile" | awk '{
 letters=$0; gsub(/[A-Z]/, "", letters);
 if (letters in matches)
    print (matches[letters] = matches[letters] " " $0);
 else
    matches[letters] = $0;
}'
The sed relies on the GNU extension to the s// command:
Quote:
The s Command
\L
Turn the replacement to lowercase until a \U or \E is found,
\U
Turn the replacement to uppercase until a \L or \E is found,
Otherwise, the sed part uses basically the same approach as pan64's.

Using http://www.cs.duke.edu/~ola/ap/linuxwords for $InFile I get
Code:
cHIck cRAck
cHInk cRAnk
tHIck tRAck
wHIps wRAps
 
1 members found this post helpful.
Old 07-03-2015, 03:16 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by pan64 View Post
Code:
$a="."
$b="hi"
$c=".. "

sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" filename
The sed looks good but I am unable to make it work. Embedding it in a bash script it looks like this ...
Code:
echo; echo "Method of LQ Guru pan64."
a="."
b="hi"
c=".. "
sed "s/^\($a\)\($b\)\($c\)$/\1\3/g" $InFile >$Work7
echo "The number of lines in InFile is" $(wc -l <$InFile)
echo "The number of lines in Work7 is " $(wc -l <$Work7)
... and produces this result ...
Code:
Method of LQ Guru pan64.
The number of lines in InFile is 119557
The number of lines in Work7 is  119557
The entire InFile was copied to the OutFile.

Please advise.

Daniel B. Martin
 
Old 07-03-2015, 03:28 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Code:
sed -n "s/^\($a\)\($b\)\($c\)$/\1\3/p" filename
 
1 members found this post helpful.
Old 07-03-2015, 03:47 PM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by ntubski View Post
Code:
sed -n "s/^\($a\)\($b\)\($c\)$/\1\3/p" filename
Much better! Thank you!

Daniel B. Martin
 
Old 07-03-2015, 04:03 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
[QUOTE=ntubski;5386678]This sed ...
Code:
echo; echo "Method #1 of LQ Senior Member ntubski."
sed -nr 's/^(.)(hi|ra)(.. )$/\L\1\U\2\E\3/p' "$InFile"  \
| awk '{letters=$0; gsub(/[A-Z]/, "", letters);
    if (letters in matches)
    print (matches[letters] = matches[letters] " " $0);
 else
    matches[letters] = $0;
}' >$Work6
... produces a storm of character strings but nothing like the good result you obtained.
Perhaps my PC lacks the required GNU extension.
Code:
daniel@daniel-desktop:~$ sed --version
GNU sed version 4.2.1
Daniel B. Martin
 
Old 07-03-2015, 04:43 PM   #13
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Here are two bash scripts for forward and backward transformation respectively and sample runs:
Code:
$ cat script1.sh
#!/bin/bash
pat="$1"
shift
p=$(echo $pat | sed -r 's/\.+|[^.]+/(&)/g')
sed -rn "s/^$p\$/\1\3/p" $@

$ cat ./script2.sh
#!/bin/bash
pat="$1"
shift
sed -rn "s/.*/@&@$pat/; :a; s/@(.)(.*)@\./\1@\2@/; s/@(.+)@([^.])/\U\2\E@\1@/;  ta; s/@@$//p" $@

# Example of forward transformation
$ ./script1.sh .hi.. /usr/share/dict/american-english
C's
Cba
Cle
Cmu
Cna
Rne
Sva
Teu
Wgs
Wte
cck
cde
cef
cld
cle
cli
cll
cme
cmp
cna
cnk
cno
cns
...

# Forward and backward transformation
$ ./script1.sh .hi.. /usr/share/dict/american-english  | ./script2.sh .hi..
CHI's
CHIba
CHIle
CHImu
CHIna
RHIne
SHIva
THIeu
WHIgs
WHIte
cHIck
cHIde
cHIef
cHIld
cHIle
cHIli
cHIll
cHIme
cHImp
cHIna
cHInk
cHIno
cHIns
...
Second script is probably the most opaque/unreadable/inefficient one here

Edit: of course we can do something like this (for reverse transform)
Code:
$ cat script3.sh
#!/bin/bash
pat="$1"
read a b c <<<$(echo $pat | sed -r 's/\b/ /g')
shift
sed -r "s/($a)($c)/\1\U$b\E\2/" $@
but it is not as funny..

Last edited by firstfire; 07-03-2015 at 04:54 PM.
 
1 members found this post helpful.
Old 07-03-2015, 05:43 PM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well we could go all bash (for something different ):
Code:
#!/usr/bin/env bash

declare -A found

a=.
b='hi|ra'
c=..

regex="^($a)($b)($c)$"
space=' '

while read line
do
  if [[ "$line" =~ $regex ]]
  then
    ind=${BASH_REMATCH[1],,}${BASH_REMATCH[3]}
    word=${BASH_REMATCH[1],,}${BASH_REMATCH[2]^^}${BASH_REMATCH[3]}

    [[ -n "${found[$ind]}" ]] && found[$ind]="${found[$ind]} $word" || found[$ind]="$word"
  fi
done<linuxwords

for entry in "${found[@]}"
do
  [[ "$entry" =~ $space ]] && echo "$entry"
done
Personally I would probably use something like:
Code:
ruby -ne 'a ||= {};$_.downcase!;if /^(.)(hi|ra)(..)$/; a[$1 + $3] ||= [];a[$1 + $3] << $1 + $2.upcase + $3;end;END{a.each_value{|v| puts v.join(" ") if v.size > 1}}' linuxwords
 
Old 07-04-2015, 12:26 AM   #15
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by danielbmartin View Post
... produces a storm of character strings but nothing like the good result you obtained.
Interesting, when I run what you posted I get only
Code:
Method #1 of LQ Senior Member ntubski.
When I remove that extra space you added to the pattern it works.


Quote:
Perhaps my PC lacks the required GNU extension.
Code:
daniel@daniel-desktop:~$ sed --version
GNU sed version 4.2.1
I have the exact same sed version.

EDIT:
----------------------
I think the most likely thing is you have very different $InFile from me, if you post the output of

Code:
sed -nr 's/^(.)(hi|ra)(.. )$/\L\1\U\2\E\3/p' "$InFile" | head
that should be enough to figure out what needs tweaking.

Last edited by ntubski; 07-04-2015 at 01:17 AM. Reason: add note about $InFile
 
1 members found this post helpful.
  


Reply

Tags
pattern, sed, text processing



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how can i append the text after pattern match using variable in sed ravikushal Linux - Newbie 18 07-23-2013 08:56 AM
[SOLVED] Replace multi line pattern by text variable in sed XXLRay Linux - Software 2 11-22-2012 10:05 AM
sed - loop construct for text processing danielbmartin Programming 5 01-24-2012 09:42 PM
how can i use sed to cut out all the text up until the pattern? daweefolk Linux - Newbie 4 02-15-2011 09:17 AM
Text substitution and processing with sed and awk shanecraddock@gmail.com Linux - Newbie 1 12-18-2008 11:34 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration