LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-23-2017, 09:56 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Matching two character strings for membership


With awk ...
the need is to determine if a character string (TestString) is composed entirely of characters in a second character string (ReferenceString)?

Examples:
If ReferenceString="gacbagjk", then
TestString="cabba" would result in 1.
TestString="aaaaaaaa" would result in 1.
TestString="cGbba" would result in 0.
TestString="RRR" would result in 0.

I've done this by testing each character in TestString individually, but that seems like a brute-force method.

Is there a slick loopless way to do it?

Daniel B. Martin
 
Old 08-23-2017, 10:34 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123
Character class ?
 
1 members found this post helpful.
Old 08-23-2017, 11:00 PM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,330
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
A bracket expression would do that. It matches any member of the set enclosed by [ and ] So [abc]+ would match any string that includes one or more of 'a', 'b', or 'c' such as 'bat', 'acre', or 'beer' but not 'dog' and so on.

Code:
awk '/[abc]+/ { print "yes" $0;}' inputfile.txt
See "man 7 regex"
 
1 members found this post helpful.
Old 08-23-2017, 11:35 PM   #4
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
Character class seems the right approach, maybe something like this:

Code:
cat ./strings
cabba
aaaaaaaa
cGbba
RRR

awk '/[gacbagjk]/ && !/[^gacbagjk]/{print $0"=1";next}{print $0"=0"}' strings
cabba=1
aaaaaaaa=1
cGbba=0
RRR=0

Last edited by astrogeek; 08-23-2017 at 11:47 PM. Reason: Shortened with next, no + reqd...
 
1 members found this post helpful.
Old 08-24-2017, 12:29 AM   #5
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,330
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.
 
2 members found this post helpful.
Old 08-24-2017, 01:05 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123
And to my mind, it should be do-able with only one test on a bracket-thingy.
Left as an exercise for the OP.
 
1 members found this post helpful.
Old 08-24-2017, 01:34 AM   #7
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
Quote:
Originally Posted by Turbocapitalist View Post
Just nitpicking about terminology. Maybe we all mean "bracket expression" as that refers to a set of characters. Because "character class" is something else, I think the words got crossed there. The demo is spot on but the actual term "character class" refers to things like alnum, digit, punct, alpha, graph, space, blank, lower, upper, cntrl, print, and xdigit As in [:digit:] to represent all digits.
You are of course, POSIX-ly correct.

I recall learning early that [abc] is a character class, and then later discovering what I now know to be POSIX character classes, [:alnum:], etc. I admit I never gave the difference much thought and have continued the common usage, referring to both forms as character classes.

Your comment has prompted me to review that usage, so I pulled a few books from the shelf, looked into man regex(7) and man awk, and a few online searches.

The books are a mixed bag on quick review. Some make the distinction, others seem not to mention it at all.

The man pages are clear and get it POSIX-ly right, at least in recent Slackware selections.

My most frequently suggested online regular expression resource seems to miss it, or at least to propagate the common usage (they may have more to say about it on other pages).

O'Reilly Sed & Awk (2nd Ed. pg. 34-38) seems to clearly explain the confusion. They begin by plainly teaching that [abc] is a character class, without comment, and show the common usage.

Later, they add the POSIX definitions and importantly, the historical view:

Quote:
3.2.4.3. POSIX character class additions

The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.

POSIX also changed what had been common terminology. What we've been calling a "character class" is called a "bracket expression" in the POSIX standard.
(Italics are mine).

I think it is a difference worth keeping in mind. Thanks for the nitpick!
 
1 members found this post helpful.
Old 08-24-2017, 03:07 AM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,330
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
@astrogeek, thanks for posting what you found on the terminology. I myself wonder why these things are not called some kind of set, like character set for example.
 
Old 08-24-2017, 04:13 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,930

Rep: Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321Reputation: 7321
what is "two character strings" in the title?
 
  


Reply

Tags
awk, membership, strings



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
lookfor matching strings between 2 files papori Linux - Newbie 1 06-04-2013 08:04 AM
Appending matching strings to specific lines (sed/bash) suntzu Programming 18 09-08-2012 03:29 PM
[SOLVED] awk character position matching dazdaz Programming 13 04-23-2011 01:19 AM
regex : matching strings of a unknown lenghtr stevie_velvet Programming 5 07-16-2006 10:56 PM
Problem matching strings with grep/egrep Seb74 Linux - Newbie 5 05-26-2005 01:40 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration