[SOLVED] Matching two character strings for membership
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
With awk ...
the need is to determine if a character string (TestString) is composed entirely of characters in a second character string (ReferenceString)?
Examples:
If ReferenceString="gacbagjk", then
TestString="cabba" would result in 1.
TestString="aaaaaaaa" would result in 1.
TestString="cGbba" would result in 0.
TestString="RRR" would result in 0.
I've done this by testing each character in TestString individually, but that seems like a brute-force method.
A bracket expression would do that. It matches any member of the set enclosed by [ and ] So [abc]+ would match any string that includes one or more of 'a', 'b', or 'c' such as 'bat', 'acre', or 'beer' but not 'dog' and so on.
Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.
Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.
You are of course, POSIX-ly correct.
I recall learning early that [abc] is a character class, and then later discovering what I now know to be POSIX character classes, [:alnum:], etc. I admit I never gave the difference much thought and have continued the common usage, referring to both forms as character classes.
Your comment has prompted me to review that usage, so I pulled a few books from the shelf, looked into man regex(7) and man awk, and a few online searches.
The books are a mixed bag on quick review. Some make the distinction, others seem not to mention it at all.
The man pages are clear and get it POSIX-ly right, at least in recent Slackware selections.
My most frequently suggested online regular expression resource seems to miss it, or at least to propagate the common usage (they may have more to say about it on other pages).
O'Reilly Sed & Awk (2nd Ed. pg. 34-38) seems to clearly explain the confusion. They begin by plainly teaching that [abc] is a character class, without comment, and show the common usage.
Later, they add the POSIX definitions and importantly, the historical view:
Quote:
3.2.4.3. POSIX character class additions
The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.
In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.
POSIX also changed what had been common terminology. What we've been calling a "character class" is called a "bracket expression" in the POSIX standard.
(Italics are mine).
I think it is a difference worth keeping in mind. Thanks for the nitpick!
@astrogeek, thanks for posting what you found on the terminology. I myself wonder why these things are not called some kind of set, like character set for example.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.