Regular expression to detect words

lucmove · 03-26-2023, 07:43 PM

For example, [0-9A-Za-zÀ-ÿ]+

That should cover all words including words that have numbers such as "B52" (but not "B-52").

The problem is, I want to consider a word:

- a string of letters, no spaces
- a string of letters and numbers, no spaces
- no string of numbers only. A word can only be a word if it contains at least one letter. It may contain numbers but must not contain only numbers.

My regular expression above doesn't filter out strings of numbers only.

This one seems to be better:

[0-9]*[A-Za-zÀ-ÿ]+[0-9]*

This one seems to be even safer:

[A-Za-zÀ-ÿ]*[0-9]*[A-Za-zÀ-ÿ]+[0-9]*[A-Za-zÀ-ÿ]*

What do you think?

Turbocapitalist · 03-26-2023, 09:11 PM

In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.

Or, you might look at just grabbing what is between word boundaries. In some styles of regular expression, those are marked as \b for both the start and end. In others, they are represented as \< and \> for start and end.

astrogeek · 03-26-2023, 09:24 PM

Character classes would make it more readable and respect the locale in effect for many applications.

This should identify words according to your definition, one per line, in many common contexts:

Code:

/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/

Results may also depend on the application or language being used so a more complete description of the problem might be helpful.

syg00 · 03-26-2023, 09:39 PM

I would imagine that the OP has already discovered that the generic classes like [:alnum:] include the minus character. The word-bounding constructs suffer similarly. And I'd imagine lines would comprise of more than one candidate.

PCRE would have to be the best best if available.

pan64 · 03-27-2023, 01:20 AM

www.regex101.com is a very good page to check your regexps. Otherwise it depends on your locale (language), your tool (if you use awk, perl, java, grep or ??).
You might need to solve it in more steps, so for example "no string of numbers only" can be a second check, which will definitely simplify the regex
(using PCRE: \w+ to check if it is a word, \d+ if it contains only numbers).

Michael Uplawski · 03-27-2023, 02:08 AM

Jeffrey Friedl knows it all.

I am afraid of reacting in a precise way to a blurry situation, mostly because it will not help the OP durably. Would it not be sufficient to state that non-white-space and white-space are rather specific and can be attacked individually?

Otherwise... take Jeffrey Friedl and roll your own.

lucmove · 03-27-2023, 11:09 AM

Quote:

Originally Posted by Turbocapitalist

In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.

Any one will do. It just has to enforce the desired rules.

Quote:

Originally Posted by astrogeek

Character classes would make it more readable and respect the locale in effect for many applications.
This should identify words according to your definition, one per line, in many common contexts:

Code:

/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/

That doesn't work. It won't filter out strings of numbers only.

Quote:

Originally Posted by pan64

www.regex101.com is a very good page to check your regexps.

I prefer Laurent Riesterer' Visual Regexp application.

No great ideas so far, I'm probably just going to use my regex [A-Za-zÀ-ÿ]*[0-9]*[A-Za-zÀ-ÿ]+[0-9]*[A-Za-zÀ-ÿ]*

boughtonp · 03-27-2023, 11:59 AM

Quote:

Originally Posted by lucmove

No great ideas so far

That's rude. The great idea everyone is re-stating is that the answer so far is "it depends", because you've not clarified sufficiently.

Many of us don't want to hand you a literal answer that we know will introduce bugs and require further support/explaining, only for the real solution to be something else entirely.

https://mywiki.wooledge.org/XyProblem

astrogeek · 03-27-2023, 12:08 PM

Quote:

Originally Posted by lucmove

That doesn't work. It won't filter out strings of numbers only.

Sure it does.

Code:

$ cat infile
12335
67abc
def
abc123fgh
jj7j
kkk
777
rrr
nothing

$ sed -n '/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/p' infile
67abc
def
abc123fgh
jj7j
kkk
rrr
nothing

But as noted above it will also allow numeric sign and decimal characters.

metaed · 03-27-2023, 12:57 PM

The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zÀ-ÿ].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:

[0-9A-Za-zÀ-ÿ]*[A-Za-zÀ-ÿ][0-9A-Za-zÀ-ÿ]*

But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:

^[0-9A-Za-zÀ-ÿ]*[A-Za-zÀ-ÿ][0-9A-Za-zÀ-ÿ]*$

pan64 · 03-27-2023, 01:34 PM

Quote:

Originally Posted by lucmove

I prefer Laurent Riesterer' Visual Regexp application.

You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.

Quote:

Originally Posted by lucmove

No great ideas so far

What goes around comes around.

lucmove · 03-27-2023, 01:52 PM

Quote:

Originally Posted by metaed

The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zÀ-ÿ].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:

[0-9A-Za-zÀ-ÿ]*[A-Za-zÀ-ÿ][0-9A-Za-zÀ-ÿ]*

But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:

^[0-9A-Za-zÀ-ÿ]*[A-Za-zÀ-ÿ][0-9A-Za-zÀ-ÿ]*$

Good breakdown. But I don't want to determine if each line is a word. I have many lines, each line has at least one word and I want to add every individual word to a control list.

Quote:

Originally Posted by lucmove

No great ideas so far,

This is irrelevant. Some good comments but no great ideas so far, which is true and is normal. Nothing to get offended about.

EDIT:

Quote:

Originally Posted by pan64

You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.

It's more than 20 years old, and it's awesome. I never knew that regular expressions had changed at all in the last 20 years. What new developments happened since PCRE, which already existed 20 years ago?

dugan · 03-27-2023, 02:10 PM

I take it you already know about \b?

metaed · 03-27-2023, 02:43 PM

Quote:

Originally Posted by lucmove

I have many lines, each line has at least one word and I want to add every individual word to a control list.

Then today grep -o might be your best friend. You would just

grep -o '[0-9A-Za-zÀ-ÿ]*[A-Za-zÀ-ÿ][0-9A-Za-zÀ-ÿ]*'

This is, of course, how the command looks before you add stdin redirection, or before you list input files on the grep command line.

The function of grep -o is to "print only the matched (non-empty) parts of a matching line, with each such part on a separate output line".

I should add that the -o option is a Gnu extension. Most Linux distros will have it. If you are limited to the basic Unix command line, you'll need a different solution.

syg00 · 03-27-2023, 08:20 PM

How would that handle the afore-mentioned B-52 ?.