[SOLVED] Perl's length() counts Umlauts multiple times
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi,
I'm programming some skript to get statistical information about some texts. This includes calculating the mean of word lengths.
Unfortunately, Umlauts count as two characters. In the example below the output is 9, it should be 6. Does anyone know how to solve this?
sincercly, Max
Code:
#!/usr/bin/perl
use POSIX;
use locale;
my $test = length("ABCÄÖÜ");
print $test;
Thank you very much for your help. Unfortunately your tip does not work when I read in the text from a text file. A minimal example is given below. I'm still figuring out how to create a text file having a specified encoding, so I'm not sure what it is exactly. It should be ISO-8859-15 or UTF8.
Do you know how to count characters correctly when reading the words from text files?
Code:
#!/usr/bin/perl
use utf8;
open(IN, "test.txt") || die "Error\n";
my @text = <IN>;
close IN;
chomp(@text);
print length($text[0]), "\n";
test.txt is a file which contains exactly one line which is "ÄÖÜ".
The output is 6, it should be 3.
The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?
Code:
perl -v
This is perl, v5.10.1 (*) built for i686-linux-gnu-thread-multi
Here is a driver script that will cause the perl script listed to be run. The data file is listed, and the final part is where the results are. You should be able to copy and run the short perl script on your system.
Code:
#!/usr/bin/env bash
# @(#) s1 Demonstrate perl utf8 functions.
# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts.
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C perl
set -o nounset
pe
FILE=${1-data1}
# Section 2, display data and script file.
# Display samples or entire file.
pe " || start [ first:middle:last ]"
specimen 10 $FILE p4 \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"
# Section 3, solution.
pl " Results:"
./p4 $FILE
exit 0
producing:
Code:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.32-26-generic, i686
Distribution : Ubuntu 10.04.1 LTS (lucid)
GNU bash 4.1.5
perl 5.10.1
|| start [ first:middle:last ]
Whole: 10:0:10 of 3 lines in file "data1"
Normal first.
ABCÄÖÜ
Normal last.
Whole: 10:0:10 of 15 lines in file "p4"
#!/usr/bin/perl
# @(#) p4 Demonstrate utf8 functions.
# See "perldoc utf8"
my ( $aline, $uline, $t1, $t2 );
while ( $aline = <> ) {
chomp($aline);
$uline = $aline;
$t1 = length($aline);
utf8::decode($uline);
$t2 = length($uline);
print " ASCII line is \"$aline\" ($t1), utf8 line is \"$uline\" ($t2)\n";
}
|| end
-----
Results:
ASCII line is "Normal first." (13), utf8 line is "Normal first." (13)
ASCII line is "ABCÄÖÜ" (9), utf8 line is "ABC���" (6)
ASCII line is "Normal last." (12), utf8 line is "Normal last." (12)
This code was run in Ubuntu 10.04, perl 5.10.1 as noted.
Quote:
Originally Posted by MaxistXXL
The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?
I would think that if you are working with perl, it would be useful to install documentation. The utf8 information came from perldoc utf8
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.