LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Character counting with sed/wc not working as expected. (https://www.linuxquestions.org/questions/linux-newbie-8/character-counting-with-sed-wc-not-working-as-expected-941742/)

standard_output 04-25-2012 06:28 PM

Character counting with sed/wc not working as expected.
 
Hello.

I am trying to count characters in a string containing a mix of characters and digits. (part of a program that parses usernames/UIDs/groupnames/GIDs. My requirements are that it must be able to handle arbitrary input (UID vs. username), and must use Bash.

No problem, I thought - sed with a regex to strip out non alpha chars, pipe to wc and count either characters or bytes (I went with characters, but it doesn't change my output)

Perhaps my machine is hosed, (just restored the VM to initial build, kernel is 2.6.18-128.el5, running RHEL 5.3)

Here is the weirdness:

Code:

echo "abc12" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc123" | sed 's/[^a-z][^A-Z]//g' | wc -m
5
echo "abc1234" | sed 's/[^a-z][^A-Z]//g' | wc -m
4
echo "abc12345" | sed 's/[^a-z][^A-Z]//g' | wc -m
5

For whatever reason, the result is always at least 1 higher than it should be. No problems there, it is easy to subtract by one. The weirdness is this: If the number of numerical digits is odd, wc finds an additional character. If the number of numerical digits is even, wc doesn't find the additional character. I don't *see* anything wrong with the regex, but am frankly baffled as to what is going on at this point. Any ideas? Does anyone else have the same behavior with sed/wc?

Thanks,

SO

jschiwal 04-25-2012 07:08 PM

Where do you come up with the -m option for wc? -c will count characters.
The echo command will add the line return. Use echo's -n option to suppress this.

The regex should be [^a-zA-Z] or [^[:alpha:]]

rknichols 04-25-2012 07:40 PM

The problem is with the regex.
Code:

sed 's/[^a-z][^A-Z]//g'
eliminates 2-character strings, the first of which is not a lower case letter and the second of which is not an upper-case letter. When a string containing an odd number of non-matching characters is processed, that final character cannot match a 2-character string and will remain in the output.

grail 04-26-2012 04:33 AM

See if this enlightens you at all, change wc -m for od -c

standard_output 04-26-2012 04:16 PM

The improved regex pattern fixed the issue. I had not previously thought that the return character was what was adding the extra char, but thanks for pointing that out as well. Script runs very nicely at this point.

grail 04-27-2012 12:13 AM

Well I would mention other option:
Code:

sed 's/[:alpha:]//g'

David the H. 04-27-2012 08:15 AM

sed is overkill here, use tr. And you might try printf instead of echo.

Code:

printf '%s' 'abcde12345' | tr -cd '[:alpha:]' | wc -c
And if you store the text in a shell variable first, you can do everything in bash.

Code:

text='abcde12345'
text="${text//[^[:alpha:]]}"
echo "$text:${#text}"



All times are GMT -5. The time now is 02:43 AM.