[SOLVED] Perl's length() counts Umlauts multiple times

MaxistXXL · 12-14-2010, 02:51 AM

Hi,
I'm programming some skript to get statistical information about some texts. This includes calculating the mean of word lengths.
Unfortunately, Umlauts count as two characters. In the example below the output is 9, it should be 6. Does anyone know how to solve this?

sincercly, Max

Code:

#!/usr/bin/perl
use POSIX;
use locale;
my $test = length("ABCÄÖÜ");
print $test;

makyo · 12-14-2010, 11:41 AM

Hi.
I don't know or use non-default locales much. However, the following script on file p2:

Code:

#!/usr/bin/perl
use utf8;
my $test = length("ABCÄÖÜ");
print " Length of string is $test\n";

produces:

Code:

% ./p2
 Length of string is 6

I found that from reading

Code:

man perlunicode

Best wishes ... cheers, makyo

Context of execution:

Code:

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 (lenny) 
perl 5.10.0

MaxistXXL · 12-15-2010, 06:10 AM

Thank you very much for your help. Unfortunately your tip does not work when I read in the text from a text file. A minimal example is given below. I'm still figuring out how to create a text file having a specified encoding, so I'm not sure what it is exactly. It should be ISO-8859-15 or UTF8.

Do you know how to count characters correctly when reading the words from text files?

Code:

#!/usr/bin/perl
use utf8;
open(IN, "test.txt") || die "Error\n";
my @text = <IN>;
close IN;
chomp(@text);
print length($text[0]), "\n";

test.txt is a file which contains exactly one line which is "ÄÖÜ".
The output is 6, it should be 3.
The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?

Code:

perl -v
This is perl, v5.10.1 (*) built for i686-linux-gnu-thread-multi

Should I update the perl-version?

sincerely, Max

makyo · 12-15-2010, 09:55 AM

Hi.

Here is a driver script that will cause the perl script listed to be run. The data file is listed, and the final part is where the results are. You should be able to copy and run the short perl script on your system.

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate perl utf8 functions.

# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C perl
set -o nounset
pe

FILE=${1-data1}

# Section 2, display data and script file.
# Display samples or entire file.
pe " || start [ first:middle:last ]"
specimen 10 $FILE p4 \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 3, solution.
pl " Results:"
./p4 $FILE

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.32-26-generic, i686
Distribution        : Ubuntu 10.04.1 LTS (lucid) 
GNU bash 4.1.5
perl 5.10.1

 || start [ first:middle:last ]
Whole: 10:0:10 of 3 lines in file "data1"
Normal first.
ABCÄÖÜ
Normal last.

Whole: 10:0:10 of 15 lines in file "p4"
#!/usr/bin/perl

# @(#) p4	Demonstrate utf8 functions.
# See "perldoc utf8"

my ( $aline, $uline, $t1, $t2 );

while ( $aline = <> ) {
  chomp($aline);
  $uline = $aline;
  $t1    = length($aline);
  utf8::decode($uline);
  $t2 = length($uline);
  print " ASCII line is \"$aline\" ($t1), utf8 line is \"$uline\" ($t2)\n";
}
 || end

-----
 Results:
 ASCII line is "Normal first." (13), utf8 line is "Normal first." (13)
 ASCII line is "ABCÄÖÜ" (9), utf8 line is "ABC���" (6)
 ASCII line is "Normal last." (12), utf8 line is "Normal last." (12)

This code was run in Ubuntu 10.04, perl 5.10.1 as noted.

Quote:

Originally Posted by MaxistXXL

The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?

I would think that if you are working with perl, it would be useful to install documentation. The utf8 information came from perldoc utf8

Best wishes ... cheers, makyo

MaxistXXL · 12-16-2010, 03:41 AM

Thank you very much. The

Code:

utf::decode($text)

helped me a lot.
Also, using "use utf8", I can use UTF8-characters in identifiers. Thats awesome.