[SOLVED] Character counting with sed/wc not working as expected.

standard_output · 04-25-2012, 06:28 PM

Hello.

I am trying to count characters in a string containing a mix of characters and digits. (part of a program that parses usernames/UIDs/groupnames/GIDs. My requirements are that it must be able to handle arbitrary input (UID vs. username), and must use Bash.

No problem, I thought - sed with a regex to strip out non alpha chars, pipe to wc and count either characters or bytes (I went with characters, but it doesn't change my output)

Perhaps my machine is hosed, (just restored the VM to initial build, kernel is 2.6.18-128.el5, running RHEL 5.3)

Here is the weirdness:

Code:

echo "abc12" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc123" | sed 's/[^a-z][^A-Z]//g' | wc -m
5
echo "abc1234" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc12345" | sed 's/[^a-z][^A-Z]//g' | wc -m
5

For whatever reason, the result is always at least 1 higher than it should be. No problems there, it is easy to subtract by one. The weirdness is this: If the number of numerical digits is odd, wc finds an additional character. If the number of numerical digits is even, wc doesn't find the additional character. I don't *see* anything wrong with the regex, but am frankly baffled as to what is going on at this point. Any ideas? Does anyone else have the same behavior with sed/wc?

Thanks,

SO

jschiwal · 04-25-2012, 07:08 PM

Where do you come up with the -m option for wc? -c will count characters.
The echo command will add the line return. Use echo's -n option to suppress this.

The regex should be [^a-zA-Z] or [^[:alpha:]]

rknichols · 04-25-2012, 07:40 PM

The problem is with the regex.

Code:

sed 's/[^a-z][^A-Z]//g'

eliminates 2-character strings, the first of which is not a lower case letter and the second of which is not an upper-case letter. When a string containing an odd number of non-matching characters is processed, that final character cannot match a 2-character string and will remain in the output.

grail · 04-26-2012, 04:33 AM

See if this enlightens you at all, change wc -m for od -c

standard_output · 04-26-2012, 04:16 PM

The improved regex pattern fixed the issue. I had not previously thought that the return character was what was adding the extra char, but thanks for pointing that out as well. Script runs very nicely at this point.

grail · 04-27-2012, 12:13 AM

Well I would mention other option:

Code:

sed 's/[:alpha:]//g'

David the H. · 04-27-2012, 08:15 AM

sed is overkill here, use tr. And you might try printf instead of echo.

Code:

printf '%s' 'abcde12345' | tr -cd '[:alpha:]' | wc -c

And if you store the text in a shell variable first, you can do everything in bash.

Code:

text='abcde12345'
text="${text//[^[:alpha:]]}"
echo "$text:${#text}"