LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Change to capital first letter of every word over specific column (https://www.linuxquestions.org/questions/programming-9/change-to-capital-first-letter-of-every-word-over-specific-column-805271/)

cgcamal 05-01-2010 01:11 AM

Change to capital first letter of every word over specific column
 
Hi guys,

Trying to change to upper case first letter of every word over a specific column.

The source file is as follow:
Code:

PRODUCT No.|SCIENCE BOOKS|DESCRIPTION
Product 1|PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687)|Blah blah blah
Product 2|Dialogue concerning the two chief world systems (1632)|blah blah blah
Product 3|De Revolutionibus Orbium Coelestium (1543)|blah blah blah
Product 4|the voyage of the beagle (1845)|blah blah blah

The output desired is:
Code:

PRODUCT No.,SCIENCE BOOKS,DESCRIPTION
Product 1|Philosophiae Naturalis Principia Mathematica (1687)|blah blah blah
Product 2|Dialogue Concerning The Two Chief World Systems (1632)|blah blah blah
Product 3|De Revolutionibus Orbium Coelestium  (1543)|blah blah blah
Product 4|The Voyage Of The Beagle (1845)|blah blah blah

I now that something similar could be done with SED using:
Code:

sed -e 's/.*/\L&/' -e 's/\<./\u&/g' file where:
sed 's/.*/\L&/' file  --> To changes to lower case all file content
sed 's/\<./\u&/g' file --> To change only first letter of each word to uppercase

But SED works over all columns and I don´t know how to do it in a specific column using SED, for this sample file the task would be over column 2.

I´ve trying with AWK either with the next script:
Code:

awk 'BEGIN{FS=OFS="|"} NR>1{$2=tolower($2);$2=gensub(/\<[A-Za-z]/,"X","g",$2)} {print $0} file'
PRODUCT No.,SCIENCE BOOKS,DESCRIPTION
Product 1|Xhilosophiae Xaturalis Xrincipia Xathematica (1687)|blah blah blah
Product 2|Xialogue Xoncerning Xhe Xwo Xhief Xorld Xystems (1632)|blah blah blah
Product 3|Xe Xevolutionibus Xrbium Xoelestium (1543)|blah blah blah
Product 4|Xhe Xoyage Xf Xhe Xeagle (1845)|blah blah blah

But this script only is a test one because replaces within column 2, only the first letter of every word with a constant "X" and I don´t know how to replace the match pattern in gensub, with the same pattern but in upper case.

Maybe somebody could give a suggestion.

Thanks in advance.

catkin 05-01-2010 04:18 AM

Here's code that does the core of what you want but it
  • Assumes the last word does not need title-casing (is always the year).
  • Does not yield true title case ("the", "of" etc. should not be capitalised).
  • Assumes there are only three "|"-separated fields.
Code:

#!/bin/bash

# Simulate reading file
line[0]="Product 1|PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687)|Blah blah blah"
line[1]="Product 2|Dialogue concerning the two chief world systems (1632)|blah blah blah"
line[2]="Product 3|De Revolutionibus Orbium Coelestium (1543)|blah blah blah"
line[3]="Product 4|the voyage of the beagle (1845)|blah blah blah"

IFS='|'
for (( i=0; i<${#line[*]}; i++ ))
do
    array=( ${line[i]} )
    #echo "${array[1]}" | sed 's/.[^[:space:]]* /XX /g'
    #echo "${array[1]}" | sed 's/\(.\)[^[:space:]]* /\1XX /g'
    #echo "${array[1]}" | sed 's/\(.\)[^[:space:]]* /\u\1XX /g'
    #echo "${array[1]}" | sed 's/\(.\)\([^[:space:]]* \)/\u\1\2/g'
    #echo "${array[1]}" | sed 's/\(.\)\([^[:space:]]* \)/\u\1\L\2/g'
    echo "${array[0]}|$( echo "${array[1]}" | sed 's/\(.\)\([^[:space:]]* \)/\u\1\L\2/g' )|${array[2]}"
done
unset IFS

The experiments used to determine the correct sed command are shown, commented out, in case they are instructive.

grail 05-01-2010 06:16 AM

Hmm ... Well I am sure someone else would be able to work out where my slip up is, but the following will set ALL first letters to capital:

Code:

sed -r -e '2,$s/\|([^|]*)/\|\L\1/' -e 's@(\b[a-z])@\u\1@g' in.txt
Edit: I was keen to find an awk solution but again we may need a guru to tidy it up, however, it does the desired result:
Code:

awk 'BEGIN{OFS=FS="|"}
    NR>1{$2=tolower($2);split($2,arr," "); $2=""; i=asorti(arr,arr2);
          for(x=1;x<=i;x++){$2=$2toupper(substr(arr[arr2[x]],1,1))substr(arr[arr2[x]],2)" "}
        }1
    ' in.txt


grail 05-01-2010 07:15 AM

Well this one may be a little more readable:
Code:

awk 'BEGIN{OFS=FS="|"}
    NR>1{$2 = tolower($2);
          split($2,arr," ");
          for(x in arr)
              sub(arr[x],toupper(substr(arr[x],1,1))substr(arr[x],2),$2)
        }1' in.txt


cgcamal 05-01-2010 03:17 PM

Hi catlin,

Many thanks for your help, I tested your solution and works, the issue is the real files have 40 columns. But I´m certainly will try to learn from the regexp you used in sed commands!.

Hi grail,

Your awk solutions is work! thanks again for your help.

The SED script work over complete line, not only over column 2.

Well, your solutions come to me several questions like always, sorry :-)

May you help me this doubts:
I´ve recently learned that "\1", "\2".."\9" is the way to remember patterns, but:

1-) What does these parts in SED script mean
1.1) "sed -r -e '2,$s/..."?

1.2) "s@(.." and ..\1@g.. "?

1.3) It looks like is not possible say to SED that work over specific column considering a
determined field separator, only using a regexp, right?

1.4) Do you know about some Unix style regexp (like use SED or AWK) tester to use
on windows?

2) The first awk script works nice, but I´ve been trying without success to remove the last
space character that is introduced in column 2 after finish the processing.

3) Similarly to 1st awk script, in the 2nd one, I´ve been trying without success to remove
the extra "(" and ")" that is introduced surround the years.

Many thanks in advance, thanks both guys.

Regards,

MTK358 05-01-2010 07:46 PM

#!/bin/sh

while read line
do
echo $(echo "$line" | cut -d'|' -f1)'|'$(echo "$line" | cut -d'|' -f3 | convert first letter to uppercase)'|'$(echo "$line" | cut -d'|' -f3)
done

grail 05-01-2010 11:42 PM

Quote:

May you help me this doubts:
I´ve recently learned that "\1", "\2".."\9" is the way to remember patterns, but:

1-) What does these parts in SED script mean
1.1) "sed -r -e '2,$s/..."?

1.2) "s@(.." and ..\1@g.. "?

1.3) It looks like is not possible say to SED that work over specific column considering a
determined field separator, only using a regexp, right?

1.4) Do you know about some Unix style regexp (like use SED or AWK) tester to use
on windows?

2) The first awk script works nice, but I´ve been trying without success to remove the last
space character that is introduced in column 2 after finish the processing.

3) Similarly to 1st awk script, in the 2nd one, I´ve been trying without success to remove
the extra "(" and ")" that is introduced surround the years.
1.1)I presume you have issue with "2,$" - this is a range saying only perform the sed on all lines between 2 and the end of the file

1.2)Here you are confused by the "@" symbol? - if so, you can have pretty much any delimeter you like, the norm is 's///' where I have used s'@@@'. If I have
lots of other slashes/sloshes ie "/" or "\" then I sometimes use this symbol

1.3)No it is possible, I just haven't worked out the kinks to get the capitalisation to also work within the delimetered boundary

2) I might need more information here as based on the example I have no extra spaces?

3) Sorry about that one, my bad ... this fixes it:
Code:

awk 'BEGIN{OFS=FS="|"}
    NR>1{$2 = tolower($2);split($2,arr," ");
          for(x in arr)
              if(arr[x] ~ /^[a-z]/)
                  sub(arr[x],toupper(substr(arr[x],1,1))substr(arr[x],2),$2)
        }1' in.txt


cgcamal 05-02-2010 12:49 AM

Hi MTK358,

Thanks for your help, I´ve tried to execute your script, but just don´t know how, I put the "inpufile" name at the end with SCRIPT inputfile but doesn´t work. How is the way to run it?

grail,

Again and again thanks.
Quote:

1.1)I presume you have issue with "2,$" - this is a range saying only perform the sed on all lines between 2 and the end of the file
I was near about this, now I´m clear. thanks!
Quote:

1.2)Here you are confused by the "@" symbol? - if so, you can have pretty much any delimeter you like, the norm is 's///' where I have used s'@@@'. If I have lots of other slashes/sloshes ie "/" or "\" then I sometimes use this symbol
Great explanation, great Tip, great to know this. I haven´t idea about this SED feature. Thanks.
Quote:

1.3)No it is possible, I just haven't worked out the kinks to get the capitalisation to also work within the delimetered boundary
Ok.
Quote:

2) I might need more information here as based on the example I have no extra spaces?
This is a little detail, is only to learn, well when I execute the first awk script the output contain an new space as last character in every line. See comparison of product2 line below:
Code:

.
Product 2|Dialogue Concerning The Two Chief World Systems (1632) |Blah blah blah| -->(With extra space at the end)
Product 2|Dialogue Concerning The Two Chief World Systems (1632)|Blah  blah blah|--> (correct output)

And the last question:
In your awk scripts, what does the "1" at the end mean? }
Code:

awk '... ...}1' in.txt
Thanks for all your help.

grail 05-02-2010 01:44 AM

Quote:

first awk script the output contain an new space as last character in every line
Again my bad as I removed that one after I came up with other solution. Please the "if" into the "for" statement as I showed in
last post should remedy this.

Quote:

In your awk scripts, what does the "1" at the end mean?
The default action for awk is to print, try these two examples to see what the difference is:
Code:

awk '0' inputfile

awk '1' inputfile

Note: Any number greater than zero will work in last example

As for MTK358's example:
Quote:

How is the way to run it?
Use this as the last line instead of the current "done"
Code:

done<inputfile

cgcamal 05-02-2010 02:21 AM

Quote:

The default action for awk is to print, try these two examples to see what the difference is:
Code:
awk '0' inputfile

awk '1' inputfile
Note: Any number greater than zero will work in last example
0 or nothing acts like "not print" and any other number greater than 0 is like "print". With every answers I learn a lot! Many thanks grail.

Your help is really appreciated.

Best regards.

MTK358 05-02-2010 07:22 AM

My script was NOT a fully working program.

The part in bold had to be replaced with something I didn't know how to do, and it would be used like this:

Code:

SCRIPT < file.txt


All times are GMT -5. The time now is 12:32 AM.