LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-14-2010, 02:51 AM   #1
MaxistXXL
Member
 
Registered: Dec 2006
Distribution: Debian/etch
Posts: 36

Rep: Reputation: 15
Perl's length() counts Umlauts multiple times


Hi,
I'm programming some skript to get statistical information about some texts. This includes calculating the mean of word lengths.
Unfortunately, Umlauts count as two characters. In the example below the output is 9, it should be 6. Does anyone know how to solve this?

sincercly, Max

Code:
#!/usr/bin/perl
use POSIX;
use locale;
my $test = length("ABC");
print $test;
 
Old 12-14-2010, 11:41 AM   #2
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.
I don't know or use non-default locales much. However, the following script on file p2:
Code:
#!/usr/bin/perl
use utf8;
my $test = length("ABC");
print " Length of string is $test\n";
produces:
Code:
% ./p2
 Length of string is 6
I found that from reading
Code:
man perlunicode
Best wishes ... cheers, makyo

Context of execution:
Code:
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 (lenny) 
perl 5.10.0
 
1 members found this post helpful.
Old 12-15-2010, 06:10 AM   #3
MaxistXXL
Member
 
Registered: Dec 2006
Distribution: Debian/etch
Posts: 36

Original Poster
Rep: Reputation: 15
Thank you very much for your help. Unfortunately your tip does not work when I read in the text from a text file. A minimal example is given below. I'm still figuring out how to create a text file having a specified encoding, so I'm not sure what it is exactly. It should be ISO-8859-15 or UTF8.

Do you know how to count characters correctly when reading the words from text files?

Code:
#!/usr/bin/perl
use utf8;
open(IN, "test.txt") || die "Error\n";
my @text = <IN>;
close IN;
chomp(@text);
print length($text[0]), "\n";
test.txt is a file which contains exactly one line which is "".
The output is 6, it should be 3.
The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?

Code:
perl -v
This is perl, v5.10.1 (*) built for i686-linux-gnu-thread-multi
Should I update the perl-version?

sincerely, Max
 
Old 12-15-2010, 09:55 AM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

Here is a driver script that will cause the perl script listed to be run. The data file is listed, and the final part is where the results are. You should be able to copy and run the short perl script on your system.
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate perl utf8 functions.

# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C perl
set -o nounset
pe

FILE=${1-data1}

# Section 2, display data and script file.
# Display samples or entire file.
pe " || start [ first:middle:last ]"
specimen 10 $FILE p4 \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 3, solution.
pl " Results:"
./p4 $FILE

exit 0
producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.32-26-generic, i686
Distribution        : Ubuntu 10.04.1 LTS (lucid) 
GNU bash 4.1.5
perl 5.10.1

 || start [ first:middle:last ]
Whole: 10:0:10 of 3 lines in file "data1"
Normal first.
ABC
Normal last.

Whole: 10:0:10 of 15 lines in file "p4"
#!/usr/bin/perl

# @(#) p4	Demonstrate utf8 functions.
# See "perldoc utf8"

my ( $aline, $uline, $t1, $t2 );

while ( $aline = <> ) {
  chomp($aline);
  $uline = $aline;
  $t1    = length($aline);
  utf8::decode($uline);
  $t2 = length($uline);
  print " ASCII line is \"$aline\" ($t1), utf8 line is \"$uline\" ($t2)\n";
}
 || end

-----
 Results:
 ASCII line is "Normal first." (13), utf8 line is "Normal first." (13)
 ASCII line is "ABC" (9), utf8 line is "ABC���" (6)
 ASCII line is "Normal last." (12), utf8 line is "Normal last." (12)
This code was run in Ubuntu 10.04, perl 5.10.1 as noted.
Quote:
Originally Posted by MaxistXXL
The command man perlunicode does not work on Ubuntu 10.10. Is it important to install?
I would think that if you are working with perl, it would be useful to install documentation. The utf8 information came from perldoc utf8

Best wishes ... cheers, makyo
 
Old 12-16-2010, 03:41 AM   #5
MaxistXXL
Member
 
Registered: Dec 2006
Distribution: Debian/etch
Posts: 36

Original Poster
Rep: Reputation: 15
Thank you very much. The
Code:
utf::decode($text)
helped me a lot.
Also, using "use utf8", I can use UTF8-characters in identifiers. Thats awesome.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Problem using realloc multiple times. pgpython Programming 6 03-03-2010 05:09 PM
crontab - how to run multiple times blizunt7 Linux - General 4 12-09-2008 09:23 AM
Konqueror opens multiple times sploit Linux - Newbie 18 07-14-2007 08:11 AM
same email, multiple times ?? (exchange... sorry) itsjustme General 1 01-14-2005 02:33 PM
Command to list line length of multiple scripts Tekime Linux - General 2 09-06-2002 01:04 AM


All times are GMT -5. The time now is 04:07 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration