[SOLVED] Text processing -- UPPER CASE doubled letters in second word of each line
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Text processing -- UPPER CASE doubled letters in second word of each line
I'm doing this as a learning exercise. In other words, "just for the fun of it."
The InFile consists of lines containing blank-delimited lower case words.
The OutFile is an edited form of the InFile.
If the second word in a line contains a doubled letter, the doubled letters are upper-cased.
Lines from the InFile which are unchanged are not copied to the OutFile.
This is my awk solution which serves to illustrate with a sample InFile and corresponding OutFile.
With this InFile ...
Code:
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
... this awk ...
Code:
awk -F "" \
'{pfb=index($0," "); # pfb=position of first blank
psb=index(substr($0,pfb+1)," "); # psb=position of second blank
{pf=0; # pf=print flag
{for (p=pfb+1;p<=pfb+psb-1;p++) # Scan second word
{if ($p==$(p+1)) # Check for doubled letter
{$p=$(p+1)=toupper($p); # Upper-case both doubled letters
pf=1}} # Turn on print flag
if (pf) print $0}}}' OFS="" $InFile >$OutFile
Same InFile and OutFile as shown in post #1, but tighter code.
Code:
awk -F "" \
'{cob=pf=0 # Initialize Count Of Blanks, Print Flag
for (j=1;j<=NF;j++) # j indexes across the line
{if ($j==" ") cob++ # Increment Count Of Blanks
if (cob==1 && $j==$(j+1)) # Check for doubled letter within word 2
{$j=$(j+1)=toupper($j) # Upper-case both doubled letters
pf=1}} # Turn on print flag
if (pf) print}' OFS="" $InFile >$OutFile
Good stuff. Little to contribute considering that I'm more of a C coder versus one who uses awk and sed except on limited occasions.
A possible thought is that when you fully develop some of these thoughts, it might be a good thing to create blog entries; all members can do that, you can go to your own MyLQ lnk up top and then on the left menu section you should see your blog, in your case empty, but you can then create blog entries and post them.
If you look under the Main Menu to the upper right on a normal forum view page, there's also a link for the LQ Blogs which will basically show new blog postings. The other thing is I post links to my blog entries to assist with complex answers for some questions.
I find it helpful in addition to threads.
And very cool that you're personally exploring things. A good thing to continue exercising the mind and intellect.
In sed main problem here is to restrict substitution to second word.
Two approaches come to mind. First is to remove everything except for second word from pattern space, do replacement (which in this case is very simple: s/((.)\2)/\U\1\E/g) and finally replace second word in the original string by the post-processed second word. This can be done like this for example:
h; # store input line in hold space
s/\w+\s+(\w+).*/\1/; # remove everything except for 2nd word
tb; :b; # reset t-flag, which would be true otherwise, as prev. s/// was successful
s/((.)\2)/\U\1\E/g; # upper-case all doubled letters
ta; # jump to 'a' if there were double letters
b; # jump to end of script
:a;
H; # append pattern space to hold space (after a newline)
x; # swap hold space and pattern space
s/^(\w+\s+)(\w+)(.*)\n(.*)/\1\4\3/p # replace second word by the word after newline
Second approach is to surround second word by distinctive markers (e.g. space, which is already there and \n) and perform upper-casing between these markers in a loop, one pair of characters at a time, shrinking marked region of a string at each iteration:
Here first substitution adds newline after second word. In the second substitution we upper-case one pair of characters and move newline before upper-cased pair. Then, if last substitution was successful, we jump to :a label and try to substitute again. If substitution failed, we check if newline has moved. If there is a non-space character after newline, then there were doubled letters, then we remove newline and print the result.
Last edited by firstfire; 09-25-2014 at 03:34 PM.
Reason: Fix a typo.
I imagine there is a clever gsub or sed solution which is shorter.
awk doesn't support back references in its regexes, meaing you can't* recognize repeated letters with regexes so gsub can't be used here. Letting awk split out the second word for you seems like it might save some code, but since we then have to replace letters in an immutable string it doesn't really come out shorter.
sec = $F[1].clone :- assign the value of the second field to variable 'sec'
puts $F * " " if :- when 'if' is true print the array $F with elements separated by spaces
$F[1].scan(/((.)\2)/) :- as with others solutions, find repeated character and store it
|x| :- values found by scan are stored in variable 'x', this will be an array
$F[1].sub!(x[0],x[0].upcase) :- like awk, sub in ruby will find the first occurrence and replace it with value after the comma. ! is important as it changes the actual value in $F[1]
This perl could be a lot shorter, but I habitually refrain from using some shortcuts for the sake of readability, maintainability, and future expansion:
Code:
while (defined(my $x=<STDIN>)) {
my $changed = 0;
while ($x =~ /^(\w+\s+\w*?)([a-z])(\2)/) {
$x = $1.uc("$2$3").$';
$changed = 1;
}
print $x if ($changed);
}
Describing the parts of the regex:
Code:
^(\w+\s+\w*?)
- matches the first word, one or more spaces, and zero or more letters of the second word in the line, up to and not including the first letter which would be matched by the next part of the regex. The *? means "zero or more, non-greedily", and without the ? it would match all of the letters of the second word. Because of the parentheses, all of the characters thus matched are stored in group $1.
Code:
([a-z])
- matches one lowercase letter and stores it in group $2.
Code:
(\2)
- a backreference which matches the letter in group $2 and stores it in group $3.
As long as $x matches this regex, it gets overwritten by concatenating $1 (the first group) with uc("$2$3") (the uppercased string constructed by interpolation of $2 and $3), and $' which is a perl builtin which contains anything after the matched part of the string (in this case $x).
Here's another GAWK program, with comments, and a few generalizations:
Code:
#!/bin/gawk -f
BEGIN {
# Set the system to ignore case when matching strings
IGNORECASE=1
}
{
# Print a blank line and the input. (Commented out after testing.)
# print ""
# print
# Separate the second field into characters
nc=split($2,character,"")
# Blank the second field (Recreated below)
$2=""
# Loop through the characters in the second field looking for duplicates
for (i=1;i<=nc;++i) {
# Save the current character (Probably not necessary, but it saves typing.)
c=character[i]
# If the i-th character is not alphabetical, just add it back into the second field
if (c !~ /[[:alpha:]]/) {
$2=$2 c
}
else {
# Uppercase any string of characters that are the same
while (i<nc && character[i+1] ~ c) {
$2 = $2 toupper(c)
++i
}
# Handle the last character matching a prior, or not.
$2 = $2 ((character[i-1] ~ c) ? toupper(c) : c)
}
}
# Write out the altered (or unaltered) input
print
}
With this "enhanced" test data:
Code:
judge lawyer attorney
lawyer attorney bookkeeper
attorney bookkeeper accountant
bookkeeper accountant auditor
accountant auditor actor
auditor actor actress
actor actress tailor
actress tailor baker
Boooo!
Boooo whoooo!!??
Boo yoooouuu, you mundane person.
I get this:
Code:
judge lawyer attorney
lawyer aTTorney bookkeeper
attorney bOOKKEEper accountant
bookkeeper aCCountant auditor
accountant auditor actor
auditor actor actress
actor actreSS tailor
actress tailor baker
Boooo!
Boooo whOOOO!!??
Boo yOOOOUUU, you mundane person.
@PTrenholme :- Whilst a nice solution, you have introduced an error with additional items you have added that your script does not account for.
So apart from only print the lines that were changed (part of OP which is an easy modification), your additional letters in the last entry, specifically yOOOOUUU,
is not correct as there are 3 Us so only 2 should have been made to upper-case.
This does of course raise the question back to the OP on how this scenario should be treated. Assuming we are reading from left to right, I see 2 possible outcomes:
1. UUu
2. UUU
There is a third which would be uUU, but I am not sure under what circumstances this solution would be desired.
This does of course raise the question back to the OP on how this scenario should be treated. ... Over to you Daniel?
Recall that this problem was contrived as a learning exercise and, in that respect, has been truly useful. The problem statement said:
Code:
The InFile consists of lines containing blank-delimited lower case words.
I was using English-language words, none of which have (TTBOMK) tripled letters. However "word" in a programming context could mean "any character string."
So, to be exacting, the problem statement called for UPPER-CASING doubled letters and left tripled letters in a gray area. PTrenholm, and other interested LQ contributors, may wish to extend the problem statement to be "UPPER-CASE all repeated character strings." A thought-provoking variation.
I solved (but did not post) a different variation on this theme. LQ people who enjoy this sort of brain-teaser might like to post a solution. Problem statement: UPPER-CASE all words in an InFile which contain three letters in alphabetic sequence. Examples:
undefined => unDEFined
first => fiRST
stuck => STUck
ghibli => GHIbli
burst => buRST
... etc. etc.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.