sed space between captioned letters

K-Veikko · 04-04-2012, 01:54 PM

I am using sed to feed festival TTS.

Is there a way to use sed to write a space [ ] between each letter in a CAPTIONED word of any length when all letters are captioned.

USA -> U S A
USA. -> U S A.
USA. -> U S A (also OK to loose dots etc. because the outcome is only listened)
XXXIIV -> X X X I I V
Word -> Word

- How to avoid space if only the first letter is captioned?

druuna · 04-05-2012, 08:20 AM

Hi,

If I assume your example is relevant (there's not much info in your post) then this works:

Code:

sed -r '/\<[A-Z]+\>/{s/./& /g;s/\.//}' infile

The green part looks for capitalized words. The \< and \> make sure that individual words are matched.
The brown part changes individual characters to individual characters followed by a space.
The blue part removes the trailing dot.

Here's an example run:

Code:

$ cat infile
USA
USA.
XXXIIV
Word

$ sed -r '/\<[A-Z]+\>/{s/./& /g;s/\.//}' infile
U S A 
U S A  
X X X I I V 
Word

BTW: The above will not work when there are multiple words on one line......

Hope this helps.

colucix · 04-05-2012, 08:32 AM

Here is a solution in awk:

Code:

{
  for (i = 1; i <= NF; i++)
    if ( $i == toupper($i)) {
      gsub(/[[:punct:]]/,"",$i)
      gsub(/./,"& ",$i)
      gsub(/ +$/,"",$i)
    }     
}
1

The first gsub removes punctuation, the second one adds a space after each character, the third one removes the extra blank space at the end of the word. Anyway, since you've posted in other *nix forum, it might not work for you. Which system are you running on? And which version of sed or awk/nawk/gawk do you have?

K-Veikko · 04-07-2012, 07:52 AM

Quote:

Originally Posted by druuna

Hi,

If I assume your example is relevant (there's not much info in your post) then this works:

Code:

sed -r '/\<[A-Z]+\>/{s/./& /g;s/\.//}' infile

I think I can use this because festival anyhow handles the words one by one. – Just did not think possibility to write one word per line.

My current "script" is in the next comment.

K-Veikko · 04-07-2012, 07:53 AM

Quote:

Originally Posted by colucix

Here is a solution in awk:

Code:

{
  for (i = 1; i <= NF; i++)
    if ( $i == toupper($i)) {
      gsub(/[[:punct:]]/,"",$i)
      gsub(/./,"& ",$i)
      gsub(/ +$/,"",$i)
    }     
}
1

The first gsub removes punctuation, the second one adds a space after each character, the third one removes the extra blank space at the end of the word. Anyway, since you've posted in other *nix forum, it might not work for you. Which system are you running on? And which version of sed or awk/nawk/gawk do you have?

Might work but I don't know how is it written into single line.

I am using Ubuntu 11.04 and
sed --version
GNU sed versio 4.2.1

My current "script" is:

Code:

cat input.txt | sed 's/\-\{1,\}\|\–\{1,\}\|\?\{1,\}\|\!\{1,\}\|\;\{1,\}\|\:\{1,\}\|\,\{1,\}\|\.\{1,\}\|\^\{1,\}\|\"\{1,\}\|\/\{1,\}\|\«\{1,\}\|\»\{1,\}/\n\n/g' | sed 's/\§/pykälä/g'    | sed 's/klo /kello /g'   | iconv -f UTF-8 -t ISO8859-1 -c    | text2wave -otype wav -eval '(language_finnish)' -o - | lame - output.mp3

fakie_flip · 05-15-2012, 11:42 AM

Quote:

Originally Posted by K-Veikko

I am using sed to feed festival TTS.

http://linuxinnovations.blogspot.com...to-speach.html