LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-25-2014, 10:37 AM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Text processing -- UPPER CASE doubled letters in second word of each line


I'm doing this as a learning exercise. In other words, "just for the fun of it."

The InFile consists of lines containing blank-delimited lower case words.
The OutFile is an edited form of the InFile.
If the second word in a line contains a doubled letter, the doubled letters are upper-cased.
Lines from the InFile which are unchanged are not copied to the OutFile.

This is my awk solution which serves to illustrate with a sample InFile and corresponding OutFile.

With this InFile ...
Code:
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
... this awk ...
Code:
awk -F ""  \
 '{pfb=index($0," ");                 # pfb=position of first blank
   psb=index(substr($0,pfb+1)," ");   # psb=position of second blank
    {pf=0;                            # pf=print flag
     {for (p=pfb+1;p<=pfb+psb-1;p++)  # Scan second word 
      {if ($p==$(p+1))                # Check for doubled letter
       {$p=$(p+1)=toupper($p);        # Upper-case both doubled letters
        pf=1}}                        # Turn on print flag
     if (pf) print $0}}}' OFS="" $InFile >$OutFile
... produced this OutFile ...
Code:
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor
This works, but seems long-winded and clumsy. I imagine there is a clever gsub or sed solution which is shorter. LQ Gurus, can you do it?

Daniel B. Martin
 
Old 09-25-2014, 01:48 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Slightly improved awk solution

Same InFile and OutFile as shown in post #1, but tighter code.
Code:
awk -F ""  \
 '{cob=pf=0                       # Initialize Count Of Blanks, Print Flag
   for (j=1;j<=NF;j++)            # j indexes across the line
     {if ($j==" ") cob++          # Increment Count Of Blanks
       if (cob==1 && $j==$(j+1))  # Check for doubled letter within word 2
         {$j=$(j+1)=toupper($j)   # Upper-case both doubled letters
           pf=1}}                 # Turn on print flag
   if (pf) print}' OFS="" $InFile >$OutFile
Daniel B. Martin
 
Old 09-25-2014, 01:56 PM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Hey.

Good stuff. Little to contribute considering that I'm more of a C coder versus one who uses awk and sed except on limited occasions.

A possible thought is that when you fully develop some of these thoughts, it might be a good thing to create blog entries; all members can do that, you can go to your own MyLQ lnk up top and then on the left menu section you should see your blog, in your case empty, but you can then create blog entries and post them.

If you look under the Main Menu to the upper right on a normal forum view page, there's also a link for the LQ Blogs which will basically show new blog postings. The other thing is I post links to my blog entries to assist with complex answers for some questions.

I find it helpful in addition to threads.

And very cool that you're personally exploring things. A good thing to continue exercising the mind and intellect.
 
Old 09-25-2014, 03:33 PM   #4
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi!

In sed main problem here is to restrict substitution to second word.
Two approaches come to mind. First is to remove everything except for second word from pattern space, do replacement (which in this case is very simple: s/((.)\2)/\U\1\E/g) and finally replace second word in the original string by the post-processed second word. This can be done like this for example:
Code:
$ cat infile 
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
$ sed -rn 'h; s/\w+\s+(\w+).*/\1/; tb; :b; s/((.)\2)/\U\1\E/g; ta; b; :a; H; x; s/^(\w+\s+)(\w+)(.*)\n(.*)/\1\4\3/p' infile
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor
Here is the annotated script:
Code:
h;                                  # store input line in hold space
s/\w+\s+(\w+).*/\1/;                # remove everything except for 2nd word
tb; :b;                             # reset t-flag, which would be true otherwise, as prev. s/// was successful
s/((.)\2)/\U\1\E/g;                 # upper-case all doubled letters
ta;                                 # jump to 'a' if there were double letters
b;                                  # jump to end of script
:a;
H;                                  # append pattern space to hold space (after a newline)
x;                                  # swap hold space and pattern space
s/^(\w+\s+)(\w+)(.*)\n(.*)/\1\4\3/p # replace second word by the word after newline
Second approach is to surround second word by distinctive markers (e.g. space, which is already there and \n) and perform upper-casing between these markers in a loop, one pair of characters at a time, shrinking marked region of a string at each iteration:
Code:
$ sed -rn 's/\w+/&\n/2; :a; s/(\s\w*)((\w)\3)(\w*)\n/\1\n\U\2\E\4/; ta; /\n\w/s/\n//p' infile
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
actor actreSS tailor
Here first substitution adds newline after second word. In the second substitution we upper-case one pair of characters and move newline before upper-cased pair. Then, if last substitution was successful, we jump to :a label and try to substitute again. If substitution failed, we check if newline has moved. If there is a non-space character after newline, then there were doubled letters, then we remove newline and print the result.

Last edited by firstfire; 09-25-2014 at 03:34 PM. Reason: Fix a typo.
 
1 members found this post helpful.
Old 09-25-2014, 03:48 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
I imagine there is a clever gsub or sed solution which is shorter.
awk doesn't support back references in its regexes, meaing you can't* recognize repeated letters with regexes so gsub can't be used here. Letting awk split out the second word for you seems like it might save some code, but since we then have to replace letters in an immutable string it doesn't really come out shorter.

Code:
awk '{new2=""; have_double=0;
for (i=1; i<=length($2); i++) {
  if(i < length($2) && substr($2, i, 1) == substr($2, i+1, 1)) {
    have_double = 1;
    new2 = new2 toupper(substr($2, i++, 2));
  } else {
    new2 = new2 substr($2, i, 1);
  }
}
$2 = new2;
} have_double' infile
* technically you could write /aa|bb|cc|.../ but that would certainly fall under "long winded and clumsy".
 
1 members found this post helpful.
Old 09-25-2014, 04:40 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by firstfire View Post
In sed main problem here is to restrict substitution to second word.
Yes, that's what I choked on.

Thank you for two impressive one-liners accompanied by a detailed explanation.
A superb response. Thank you!

Daniel B. Martin
 
Old 09-26-2014, 11:13 AM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
A shorter, cleaner solution

After much thrashing around I have a shorter solution.
Granted, it has two lines rather than one but has (in my eyes) an aesthetic appeal.
Code:
 sed 's/\(.\)\1/\U&/g' $InFile  \
|awk '{$0=tolower($1)" "$2" "tolower($3)
       if ($0~/[[:upper:]]/) print}' >$OutFile
Daniel B. Martin
 
Old 09-26-2014, 12:05 PM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Others have shown the awk solutions, thought some ruby (2.1.2p95) might make a nice change
Code:
ruby -ane 'sec = $F[1].clone;puts $F * " " if $F[1].scan(/((.)\2)/){|x| $F[1].sub!(x[0],x[0].upcase)} != sec' file
For the un-initiated:

sec = $F[1].clone :- assign the value of the second field to variable 'sec'

puts $F * " " if :- when 'if' is true print the array $F with elements separated by spaces

$F[1].scan(/((.)\2)/) :- as with others solutions, find repeated character and store it

|x| :- values found by scan are stored in variable 'x', this will be an array

$F[1].sub!(x[0],x[0].upcase) :- like awk, sub in ruby will find the first occurrence and replace it with value after the comma. ! is important as it changes the actual value in $F[1]

!= sec :- comparison used for 'if'
 
Old 09-26-2014, 12:23 PM   #9
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,278

Rep: Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694
Quote:
Originally Posted by danielbmartin View Post
Yes, that's what I choked on.

Thank you for two impressive one-liners accompanied by a detailed explanation.
A superb response. Thank you!

Daniel B. Martin
If you use an older version of sed, you can use the $2 variable to select the second column, like awk.

I like your improved solution. Nothing wrong with it.

Last edited by szboardstretcher; 09-26-2014 at 12:24 PM.
 
Old 09-26-2014, 04:24 PM   #10
ttk
Senior Member
 
Registered: May 2012
Location: Sebastopol, CA
Distribution: Slackware64
Posts: 1,038
Blog Entries: 27

Rep: Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484
This perl could be a lot shorter, but I habitually refrain from using some shortcuts for the sake of readability, maintainability, and future expansion:

Code:
while (defined(my $x=<STDIN>)) {
    my $changed = 0;
    while ($x =~ /^(\w+\s+\w*?)([a-z])(\2)/) {
        $x = $1.uc("$2$3").$';
        $changed = 1;
    }
    print $x if ($changed);
}
Describing the parts of the regex:

Code:
^(\w+\s+\w*?)
- matches the first word, one or more spaces, and zero or more letters of the second word in the line, up to and not including the first letter which would be matched by the next part of the regex. The *? means "zero or more, non-greedily", and without the ? it would match all of the letters of the second word. Because of the parentheses, all of the characters thus matched are stored in group $1.

Code:
([a-z])
- matches one lowercase letter and stores it in group $2.

Code:
(\2)
- a backreference which matches the letter in group $2 and stores it in group $3.

As long as $x matches this regex, it gets overwritten by concatenating $1 (the first group) with uc("$2$3") (the uppercased string constructed by interpolation of $2 and $3), and $' which is a perl builtin which contains anything after the matched part of the string (in this case $x).

The rest is pretty straightforward.
 
Old 09-26-2014, 09:51 PM   #11
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Here's another GAWK program, with comments, and a few generalizations:
Code:
#!/bin/gawk -f
BEGIN {
# Set the system to ignore case when matching strings
  IGNORECASE=1
}
{
# Print a blank line and the input. (Commented out after testing.)
# print ""
# print
# Separate the second field into characters
  nc=split($2,character,"")
# Blank the second field (Recreated below)
  $2=""
# Loop through the characters in the second field looking for duplicates
  for (i=1;i<=nc;++i) {
# Save the current character (Probably not necessary, but it saves typing.)
    c=character[i]
# If the i-th character is not alphabetical, just add it back into the second field 
    if (c !~ /[[:alpha:]]/) {
      $2=$2 c
    }
    else {
# Uppercase any string of characters that are the same
      while (i<nc && character[i+1] ~ c) {
        $2 = $2 toupper(c)
        ++i
      }
# Handle the last character matching a prior, or not.
      $2 = $2 ((character[i-1] ~ c) ? toupper(c) : c)
    }
  }
# Write out the altered (or unaltered) input
  print
}
With this "enhanced" test data:
Code:
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
Boooo!
Boooo whoooo!!??
Boo yoooouuu, you mundane person.
I get this:
Code:
judge lawyer attorney
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
accountant auditor actor
auditor actor actress
actor actreSS tailor
actress tailor baker
Boooo! 
Boooo whOOOO!!??
Boo yOOOOUUU, you mundane person.
 
Old 09-27-2014, 03:38 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
@PTrenholme :- Whilst a nice solution, you have introduced an error with additional items you have added that your script does not account for.

So apart from only print the lines that were changed (part of OP which is an easy modification), your additional letters in the last entry, specifically yOOOOUUU,
is not correct as there are 3 Us so only 2 should have been made to upper-case.

This does of course raise the question back to the OP on how this scenario should be treated. Assuming we are reading from left to right, I see 2 possible outcomes:

1. UUu
2. UUU

There is a third which would be uUU, but I am not sure under what circumstances this solution would be desired.

Over to you Daniel?
 
Old 09-27-2014, 04:03 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Oh, I also forgot to add my awk solution:
Code:
#!/usr/bin/awk -f

{
  p = 0 
  for(i = 1; i < length($2); i++)
  {
    c = substr($2,i,1)
    if( c ~ /[a-z]/ && sub(c c,toupper(c c),$2))
      p = 1 
  }
}

p
 
Old 09-27-2014, 10:44 AM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by grail View Post
This does of course raise the question back to the OP on how this scenario should be treated. ... Over to you Daniel?
Recall that this problem was contrived as a learning exercise and, in that respect, has been truly useful. The problem statement said:
Code:
 The InFile consists of lines containing blank-delimited lower case words.
I was using English-language words, none of which have (TTBOMK) tripled letters. However "word" in a programming context could mean "any character string."

So, to be exacting, the problem statement called for UPPER-CASING doubled letters and left tripled letters in a gray area. PTrenholm, and other interested LQ contributors, may wish to extend the problem statement to be "UPPER-CASE all repeated character strings." A thought-provoking variation.

I solved (but did not post) a different variation on this theme. LQ people who enjoy this sort of brain-teaser might like to post a solution. Problem statement: UPPER-CASE all words in an InFile which contain three letters in alphabetic sequence. Examples:
undefined => unDEFined
first => fiRST
stuck => STUck
ghibli => GHIbli
burst => buRST
... etc. etc.

Daniel B. Martin
 
Old 09-27-2014, 11:09 AM   #15
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi, Daniel.

Code:
$ echo first | sed -rn 's/$/\nabcdefghijklmnopqrstuvwxyz/; s/(...)(.*)\n.*\1.*/\U\1\E/p'
fiRST
 
1 members found this post helpful.
  


Reply

Tags
awk, gsub, sed



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Finding doubled letters with grep danielbmartin Programming 13 09-24-2014 06:08 AM
[SOLVED] Bash: Checking for lower case and upper case letters in a string fatalerror0x00 Programming 1 12-09-2012 02:17 AM
Question about creating files, and upper and lower case letters clifford227 Linux - Newbie 10 08-23-2012 05:02 AM
File extension and upper and lower case in .text files thepowerofone Linux - Newbie 9 11-25-2011 11:47 AM
upper case letters in kde 3.5through tight vnc on debian sunpascal Linux - Software 0 03-28-2006 05:37 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:56 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration