[SOLVED] Matching two character strings for membership

danielbmartin · 08-23-2017, 09:56 PM

With awk ...
the need is to determine if a character string (TestString) is composed entirely of characters in a second character string (ReferenceString)?

Examples:
If ReferenceString="gacbagjk", then
TestString="cabba" would result in 1.
TestString="aaaaaaaa" would result in 1.
TestString="cGbba" would result in 0.
TestString="RRR" would result in 0.

I've done this by testing each character in TestString individually, but that seems like a brute-force method.

Is there a slick loopless way to do it?

Daniel B. Martin

syg00 · 08-23-2017, 10:34 PM

Character class ?

Turbocapitalist · 08-23-2017, 11:00 PM

A bracket expression would do that. It matches any member of the set enclosed by [ and ] So [abc]+ would match any string that includes one or more of 'a', 'b', or 'c' such as 'bat', 'acre', or 'beer' but not 'dog' and so on.

Code:

awk '/[abc]+/ { print "yes" $0;}' inputfile.txt

See "man 7 regex"

astrogeek · 08-23-2017, 11:35 PM

Character class seems the right approach, maybe something like this:

Code:

cat ./strings
cabba
aaaaaaaa
cGbba
RRR

awk '/[gacbagjk]/ && !/[^gacbagjk]/{print $0"=1";next}{print $0"=0"}' strings
cabba=1
aaaaaaaa=1
cGbba=0
RRR=0

Turbocapitalist · 08-24-2017, 12:29 AM

Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.

syg00 · 08-24-2017, 01:05 AM

And to my mind, it should be do-able with only one test on a bracket-thingy.
Left as an exercise for the OP.

astrogeek · 08-24-2017, 01:34 AM

Quote:

Originally Posted by Turbocapitalist

Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.

You are of course, POSIX-ly correct.

I recall learning early that [abc] is a character class, and then later discovering what I now know to be POSIX character classes, [:alnum:], etc. I admit I never gave the difference much thought and have continued the common usage, referring to both forms as character classes.

Your comment has prompted me to review that usage, so I pulled a few books from the shelf, looked into man regex(7) and man awk, and a few online searches.

The books are a mixed bag on quick review. Some make the distinction, others seem not to mention it at all.

The man pages are clear and get it POSIX-ly right, at least in recent Slackware selections.

My most frequently suggested online regular expression resource seems to miss it, or at least to propagate the common usage (they may have more to say about it on other pages).

O'Reilly Sed & Awk (2nd Ed. pg. 34-38) seems to clearly explain the confusion. They begin by plainly teaching that [abc] is a character class, without comment, and show the common usage.

Later, they add the POSIX definitions and importantly, the historical view:

Quote:

3.2.4.3. POSIX character class additions

The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.

POSIX also changed what had been common terminology. What we've been calling a "character class" is called a "bracket expression" in the POSIX standard.

(Italics are mine).

I think it is a difference worth keeping in mind. Thanks for the nitpick!

Turbocapitalist · 08-24-2017, 03:07 AM

@astrogeek, thanks for posting what you found on the terminology. I myself wonder why these things are not called some kind of set, like character set for example.

pan64 · 08-24-2017, 04:13 AM

what is "two character strings" in the title?