LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-25-2012, 06:28 PM   #1
standard_output
LQ Newbie
 
Registered: Apr 2012
Posts: 16

Rep: Reputation: Disabled
Character counting with sed/wc not working as expected.


Hello.

I am trying to count characters in a string containing a mix of characters and digits. (part of a program that parses usernames/UIDs/groupnames/GIDs. My requirements are that it must be able to handle arbitrary input (UID vs. username), and must use Bash.

No problem, I thought - sed with a regex to strip out non alpha chars, pipe to wc and count either characters or bytes (I went with characters, but it doesn't change my output)

Perhaps my machine is hosed, (just restored the VM to initial build, kernel is 2.6.18-128.el5, running RHEL 5.3)

Here is the weirdness:

Code:
echo "abc12" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc123" | sed 's/[^a-z][^A-Z]//g' | wc -m
5
echo "abc1234" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc12345" | sed 's/[^a-z][^A-Z]//g' | wc -m
5
For whatever reason, the result is always at least 1 higher than it should be. No problems there, it is easy to subtract by one. The weirdness is this: If the number of numerical digits is odd, wc finds an additional character. If the number of numerical digits is even, wc doesn't find the additional character. I don't *see* anything wrong with the regex, but am frankly baffled as to what is going on at this point. Any ideas? Does anyone else have the same behavior with sed/wc?

Thanks,

SO
 
Old 04-25-2012, 07:08 PM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Where do you come up with the -m option for wc? -c will count characters.
The echo command will add the line return. Use echo's -n option to suppress this.

The regex should be [^a-zA-Z] or [^[:alpha:]]

Last edited by jschiwal; 04-25-2012 at 07:20 PM.
 
Old 04-25-2012, 07:40 PM   #3
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,776

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
The problem is with the regex.
Code:
sed 's/[^a-z][^A-Z]//g'
eliminates 2-character strings, the first of which is not a lower case letter and the second of which is not an upper-case letter. When a string containing an odd number of non-matching characters is processed, that final character cannot match a 2-character string and will remain in the output.
 
Old 04-26-2012, 04:33 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
See if this enlightens you at all, change wc -m for od -c
 
Old 04-26-2012, 04:16 PM   #5
standard_output
LQ Newbie
 
Registered: Apr 2012
Posts: 16

Original Poster
Rep: Reputation: Disabled
The improved regex pattern fixed the issue. I had not previously thought that the return character was what was adding the extra char, but thanks for pointing that out as well. Script runs very nicely at this point.
 
Old 04-27-2012, 12:13 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well I would mention other option:
Code:
sed 's/[:alpha:]//g'
 
Old 04-27-2012, 08:15 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
sed is overkill here, use tr. And you might try printf instead of echo.

Code:
printf '%s' 'abcde12345' | tr -cd '[:alpha:]' | wc -c
And if you store the text in a shell variable first, you can do everything in bash.

Code:
text='abcde12345'
text="${text//[^[:alpha:]]}"
echo "$text:${#text}"

Last edited by David the H.; 04-27-2012 at 08:19 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Putty, SSH, Arrow Keys Produce Character String Instead Of Working As Expected newOperator Debian 4 12-21-2011 10:45 AM
Counting to a specific character mikehalfogre Programming 9 02-12-2010 02:12 PM
Bash: Counting the number of character occurences in a variable basildon Linux - Newbie 3 09-22-2008 10:11 AM
Character counting in Vi nirvana4ol Linux - General 8 12-18-2007 05:04 AM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:01 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration