[SOLVED] Text processing -- UPPER CASE doubled letters in second word of each line

danielbmartin · 09-25-2014, 10:37 AM

I'm doing this as a learning exercise. In other words, "just for the fun of it."

The InFile consists of lines containing blank-delimited lower case words.
The OutFile is an edited form of the InFile.
If the second word in a line contains a doubled letter, the doubled letters are upper-cased.
Lines from the InFile which are unchanged are not copied to the OutFile.

This is my awk solution which serves to illustrate with a sample InFile and corresponding OutFile.

With this InFile ...

Code:

judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker

... this awk ...

Code:

awk -F ""  \
 '{pfb=index($0," ");                 # pfb=position of first blank
   psb=index(substr($0,pfb+1)," ");   # psb=position of second blank
    {pf=0;                            # pf=print flag
     {for (p=pfb+1;p<=pfb+psb-1;p++)  # Scan second word 
      {if ($p==$(p+1))                # Check for doubled letter
       {$p=$(p+1)=toupper($p);        # Upper-case both doubled letters
        pf=1}}                        # Turn on print flag
     if (pf) print $0}}}' OFS="" $InFile >$OutFile

... produced this OutFile ...

Code:

lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor

This works, but seems long-winded and clumsy. I imagine there is a clever gsub or sed solution which is shorter. LQ Gurus, can you do it?

Daniel B. Martin

danielbmartin · 09-25-2014, 01:48 PM

Same InFile and OutFile as shown in post #1, but tighter code.

Code:

awk -F ""  \
 '{cob=pf=0                       # Initialize Count Of Blanks, Print Flag
   for (j=1;j<=NF;j++)            # j indexes across the line
     {if ($j==" ") cob++          # Increment Count Of Blanks
       if (cob==1 && $j==$(j+1))  # Check for doubled letter within word 2
         {$j=$(j+1)=toupper($j)   # Upper-case both doubled letters
           pf=1}}                 # Turn on print flag
   if (pf) print}' OFS="" $InFile >$OutFile

Daniel B. Martin

rtmistler · 09-25-2014, 01:56 PM

Hey.

Good stuff. Little to contribute considering that I'm more of a C coder versus one who uses awk and sed except on limited occasions.

A possible thought is that when you fully develop some of these thoughts, it might be a good thing to create blog entries; all members can do that, you can go to your own MyLQ lnk up top and then on the left menu section you should see your blog, in your case empty, but you can then create blog entries and post them.

If you look under the Main Menu to the upper right on a normal forum view page, there's also a link for the LQ Blogs which will basically show new blog postings. The other thing is I post links to my blog entries to assist with complex answers for some questions.

I find it helpful in addition to threads.

And very cool that you're personally exploring things. A good thing to continue exercising the mind and intellect.

firstfire · 09-25-2014, 03:33 PM

Hi!

In sed main problem here is to restrict substitution to second word.
Two approaches come to mind. First is to remove everything except for second word from pattern space, do replacement (which in this case is very simple: s/((.)\2)/\U\1\E/g) and finally replace second word in the original string by the post-processed second word. This can be done like this for example:

Code:

$ cat infile 
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
$ sed -rn 'h; s/\w+\s+(\w+).*/\1/; tb; :b; s/((.)\2)/\U\1\E/g; ta; b; :a; H; x; s/^(\w+\s+)(\w+)(.*)\n(.*)/\1\4\3/p' infile
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor

Here is the annotated script:

Code:

h;                                  # store input line in hold space
s/\w+\s+(\w+).*/\1/;                # remove everything except for 2nd word
tb; :b;                             # reset t-flag, which would be true otherwise, as prev. s/// was successful
s/((.)\2)/\U\1\E/g;                 # upper-case all doubled letters
ta;                                 # jump to 'a' if there were double letters
b;                                  # jump to end of script
:a;
H;                                  # append pattern space to hold space (after a newline)
x;                                  # swap hold space and pattern space
s/^(\w+\s+)(\w+)(.*)\n(.*)/\1\4\3/p # replace second word by the word after newline

Second approach is to surround second word by distinctive markers (e.g. space, which is already there and \n) and perform upper-casing between these markers in a loop, one pair of characters at a time, shrinking marked region of a string at each iteration:

Code:

$ sed -rn 's/\w+/&\n/2; :a; s/(\s\w*)((\w)\3)(\w*)\n/\1\n\U\2\E\4/; ta; /\n\w/s/\n//p' infile
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor

Here first substitution adds newline after second word. In the second substitution we upper-case one pair of characters and move newline before upper-cased pair. Then, if last substitution was successful, we jump to :a label and try to substitute again. If substitution failed, we check if newline has moved. If there is a non-space character after newline, then there were doubled letters, then we remove newline and print the result.

ntubski · 09-25-2014, 03:48 PM

Quote:

I imagine there is a clever gsub or sed solution which is shorter.

awk doesn't support back references in its regexes, meaing you can't* recognize repeated letters with regexes so gsub can't be used here. Letting awk split out the second word for you seems like it might save some code, but since we then have to replace letters in an immutable string it doesn't really come out shorter.

Code:

awk '{new2=""; have_double=0;
for (i=1; i<=length($2); i++) {
  if(i < length($2) && substr($2, i, 1) == substr($2, i+1, 1)) {
    have_double = 1;
    new2 = new2 toupper(substr($2, i++, 2));
  } else {
    new2 = new2 substr($2, i, 1);
  }
}
$2 = new2;
} have_double' infile

* technically you could write /aa|bb|cc|.../ but that would certainly fall under "long winded and clumsy".

danielbmartin · 09-25-2014, 04:40 PM

Quote:

Originally Posted by firstfire

In sed main problem here is to restrict substitution to second word.

Yes, that's what I choked on.

Thank you for two impressive one-liners accompanied by a detailed explanation.
A superb response. Thank you!

Daniel B. Martin

danielbmartin · 09-26-2014, 11:13 AM

After much thrashing around I have a shorter solution.
Granted, it has two lines rather than one but has (in my eyes) an aesthetic appeal.

Code:

 sed 's/\(.\)\1/\U&/g' $InFile  \
|awk '{$0=tolower($1)" "$2" "tolower($3)
       if ($0~/[[:upper:]]/) print}' >$OutFile

Daniel B. Martin

grail · 09-26-2014, 12:05 PM

Others have shown the awk solutions, thought some ruby (2.1.2p95) might make a nice change

Code:

ruby -ane 'sec = $F[1].clone;puts $F * " " if $F[1].scan(/((.)\2)/){|x| $F[1].sub!(x[0],x[0].upcase)} != sec' file

For the un-initiated:

sec = $F[1].clone :- assign the value of the second field to variable 'sec'

puts $F * " " if :- when 'if' is true print the array $F with elements separated by spaces

$F[1].scan(/((.)\2)/) :- as with others solutions, find repeated character and store it

|x| :- values found by scan are stored in variable 'x', this will be an array

$F[1].sub!(x[0],x[0].upcase) :- like awk, sub in ruby will find the first occurrence and replace it with value after the comma. ! is important as it changes the actual value in $F[1]

!= sec :- comparison used for 'if'

szboardstretcher · 09-26-2014, 12:23 PM

Quote:

Originally Posted by danielbmartin

Yes, that's what I choked on.

Thank you for two impressive one-liners accompanied by a detailed explanation.
A superb response. Thank you!

Daniel B. Martin

If you use an older version of sed, you can use the $2 variable to select the second column, like awk.

I like your improved solution. Nothing wrong with it.

ttk · 09-26-2014, 04:24 PM

This perl could be a lot shorter, but I habitually refrain from using some shortcuts for the sake of readability, maintainability, and future expansion:

Code:

while (defined(my $x=<STDIN>)) {
    my $changed = 0;
    while ($x =~ /^(\w+\s+\w*?)([a-z])(\2)/) {
        $x = $1.uc("$2$3").$';
        $changed = 1;
    }
    print $x if ($changed);
}

Describing the parts of the regex:

Code:

^(\w+\s+\w*?)

- matches the first word, one or more spaces, and zero or more letters of the second word in the line, up to and not including the first letter which would be matched by the next part of the regex. The *? means "zero or more, non-greedily", and without the ? it would match all of the letters of the second word. Because of the parentheses, all of the characters thus matched are stored in group $1.

Code:

([a-z])

- matches one lowercase letter and stores it in group $2.

Code:

(\2)

- a backreference which matches the letter in group $2 and stores it in group $3.

As long as $x matches this regex, it gets overwritten by concatenating $1 (the first group) with uc("$2$3") (the uppercased string constructed by interpolation of $2 and $3), and $' which is a perl builtin which contains anything after the matched part of the string (in this case $x).

The rest is pretty straightforward.

PTrenholme · 09-26-2014, 09:51 PM

Here's another GAWK program, with comments, and a few generalizations:

Code:

#!/bin/gawk -f
BEGIN {
# Set the system to ignore case when matching strings
  IGNORECASE=1
}
{
# Print a blank line and the input. (Commented out after testing.)
# print ""
# print
# Separate the second field into characters
  nc=split($2,character,"")
# Blank the second field (Recreated below)
  $2=""
# Loop through the characters in the second field looking for duplicates
  for (i=1;i<=nc;++i) {
# Save the current character (Probably not necessary, but it saves typing.)
    c=character[i]
# If the i-th character is not alphabetical, just add it back into the second field 
    if (c !~ /[[:alpha:]]/) {
      $2=$2 c
    }
    else {
# Uppercase any string of characters that are the same
      while (i<nc && character[i+1] ~ c) {
        $2 = $2 toupper(c)
        ++i
      }
# Handle the last character matching a prior, or not.
      $2 = $2 ((character[i-1] ~ c) ? toupper(c) : c)
    }
  }
# Write out the altered (or unaltered) input
  print
}

With this "enhanced" test data:

Code:

judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
Boooo!
Boooo whoooo!!??
Boo yoooouuu, you mundane person.

I get this:

Code:

judge lawyer attorney
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
accountant auditor actor
auditor actor actress
actor actreSS tailor
actress tailor baker
Boooo! 
Boooo whOOOO!!??
Boo yOOOOUUU, you mundane person.

grail · 09-27-2014, 03:38 AM

@PTrenholme :- Whilst a nice solution, you have introduced an error with additional items you have added that your script does not account for.

So apart from only print the lines that were changed (part of OP which is an easy modification), your additional letters in the last entry, specifically yOOOOUUU,
is not correct as there are 3 Us so only 2 should have been made to upper-case.

This does of course raise the question back to the OP on how this scenario should be treated. Assuming we are reading from left to right, I see 2 possible outcomes:

1. UUu
2. UUU

There is a third which would be uUU, but I am not sure under what circumstances this solution would be desired.

Over to you Daniel?

grail · 09-27-2014, 04:03 AM

Oh, I also forgot to add my awk solution:

Code:

#!/usr/bin/awk -f

{
  p = 0 
  for(i = 1; i < length($2); i++)
  {
    c = substr($2,i,1)
    if( c ~ /[a-z]/ && sub(c c,toupper(c c),$2))
      p = 1 
  }
}

p

danielbmartin · 09-27-2014, 10:44 AM

Quote:

Originally Posted by grail

This does of course raise the question back to the OP on how this scenario should be treated. ... Over to you Daniel?

Recall that this problem was contrived as a learning exercise and, in that respect, has been truly useful. The problem statement said:

Code:

 The InFile consists of lines containing blank-delimited lower case words.

I was using English-language words, none of which have (TTBOMK) tripled letters. However "word" in a programming context could mean "any character string."

So, to be exacting, the problem statement called for UPPER-CASING doubled letters and left tripled letters in a gray area. PTrenholm, and other interested LQ contributors, may wish to extend the problem statement to be "UPPER-CASE all repeated character strings." A thought-provoking variation.

I solved (but did not post) a different variation on this theme. LQ people who enjoy this sort of brain-teaser might like to post a solution. Problem statement: UPPER-CASE all words in an InFile which contain three letters in alphabetic sequence. Examples:
undefined => unDEFined
first => fiRST
stuck => STUck
ghibli => GHIbli
burst => buRST
... etc. etc.

Daniel B. Martin

firstfire · 09-27-2014, 11:09 AM

Hi, Daniel.

Code:

$ echo first | sed -rn 's/$/\nabcdefghijklmnopqrstuvwxyz/; s/(...)(.*)\n.*\1.*/\U\1\E/p'
fiRST