LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-26-2023, 07:43 PM   #1
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,432

Rep: Reputation: 110Reputation: 110
Regular expression to detect words


For example, [0-9A-Za-zĄ-’]+

That should cover all words including words that have numbers such as "B52" (but not "B-52").

The problem is, I want to consider a word:

- a string of letters, no spaces
- a string of letters and numbers, no spaces
- no string of numbers only. A word can only be a word if it contains at least one letter. It may contain numbers but must not contain only numbers.

My regular expression above doesn't filter out strings of numbers only.

This one seems to be better:

[0-9]*[A-Za-zĄ-’]+[0-9]*

This one seems to be even safer:

[A-Za-zĄ-’]*[0-9]*[A-Za-zĄ-’]+[0-9]*[A-Za-zĄ-’]*

What do you think?

Last edited by lucmove; 03-26-2023 at 07:45 PM.
 
Old 03-26-2023, 09:11 PM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.

Or, you might look at just grabbing what is between word boundaries. In some styles of regular expression, those are marked as \b for both the start and end. In others, they are represented as \< and \> for start and end.
 
1 members found this post helpful.
Old 03-26-2023, 09:24 PM   #3
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
Character classes would make it more readable and respect the locale in effect for many applications.

This should identify words according to your definition, one per line, in many common contexts:
Code:
/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/
Results may also depend on the application or language being used so a more complete description of the problem might be helpful.

Last edited by astrogeek; 03-26-2023 at 09:26 PM.
 
2 members found this post helpful.
Old 03-26-2023, 09:39 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
I would imagine that the OP has already discovered that the generic classes like [:alnum:] include the minus character. The word-bounding constructs suffer similarly. And I'd imagine lines would comprise of more than one candidate.

PCRE would have to be the best best if available.
 
1 members found this post helpful.
Old 03-27-2023, 01:20 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,840

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
www.regex101.com is a very good page to check your regexps. Otherwise it depends on your locale (language), your tool (if you use awk, perl, java, grep or ??).
You might need to solve it in more steps, so for example "no string of numbers only" can be a second check, which will definitely simplify the regex
(using PCRE: \w+ to check if it is a word, \d+ if it contains only numbers).
 
2 members found this post helpful.
Old 03-27-2023, 02:08 AM   #6
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,622
Blog Entries: 40

Rep: Reputation: Disabled
Jeffrey Friedl knows it all.

I am afraid of reacting in a precise way to a blurry situation, mostly because it will not help the OP durably. Would it not be sufficient to state that non-white-space and white-space are rather specific and can be attacked individually?

Otherwise... take Jeffrey Friedl and roll your own.
 
Old 03-27-2023, 11:09 AM   #7
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,432

Original Poster
Rep: Reputation: 110Reputation: 110
Quote:
Originally Posted by Turbocapitalist View Post
In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.
Any one will do. It just has to enforce the desired rules.

Quote:
Originally Posted by astrogeek View Post
Character classes would make it more readable and respect the locale in effect for many applications.
This should identify words according to your definition, one per line, in many common contexts:
Code:
/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/
That doesn't work. It won't filter out strings of numbers only.

Quote:
Originally Posted by pan64 View Post
www.regex101.com is a very good page to check your regexps.
I prefer Laurent Riesterer' Visual Regexp application.

No great ideas so far, I'm probably just going to use my regex [A-Za-zĄ-’]*[0-9]*[A-Za-zĄ-’]+[0-9]*[A-Za-zĄ-’]*
 
Old 03-27-2023, 11:59 AM   #8
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,599

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546
Quote:
Originally Posted by lucmove View Post
No great ideas so far
That's rude. The great idea everyone is re-stating is that the answer so far is "it depends", because you've not clarified sufficiently.

Many of us don't want to hand you a literal answer that we know will introduce bugs and require further support/explaining, only for the real solution to be something else entirely.

https://mywiki.wooledge.org/XyProblem

 
2 members found this post helpful.
Old 03-27-2023, 12:08 PM   #9
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
Quote:
Originally Posted by lucmove View Post
That doesn't work. It won't filter out strings of numbers only.
Sure it does.

Code:
$ cat infile
12335
67abc
def
abc123fgh
jj7j
kkk
777
rrr
nothing

$ sed -n '/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/p' infile
67abc
def
abc123fgh
jj7j
kkk
rrr
nothing
But as noted above it will also allow numeric sign and decimal characters.
 
1 members found this post helpful.
Old 03-27-2023, 12:57 PM   #10
metaed
Member
 
Registered: Apr 2022
Location: US
Distribution: Slackware64 15.0
Posts: 363

Rep: Reputation: 170Reputation: 170
The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zĄ-’].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:

[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*

But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:

^[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*$
 
1 members found this post helpful.
Old 03-27-2023, 01:34 PM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,840

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
Quote:
Originally Posted by lucmove View Post
I prefer Laurent Riesterer' Visual Regexp application.
You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.
Quote:
Originally Posted by lucmove View Post
No great ideas so far
What goes around comes around.
 
Old 03-27-2023, 01:52 PM   #12
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,432

Original Poster
Rep: Reputation: 110Reputation: 110
Quote:
Originally Posted by metaed View Post
The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zĄ-’].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:

[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*

But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:

^[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*$
Good breakdown. But I don't want to determine if each line is a word. I have many lines, each line has at least one word and I want to add every individual word to a control list.

Quote:
Originally Posted by lucmove View Post
No great ideas so far,
This is irrelevant. Some good comments but no great ideas so far, which is true and is normal. Nothing to get offended about.

EDIT:

Quote:
Originally Posted by pan64 View Post
You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.
It's more than 20 years old, and it's awesome. I never knew that regular expressions had changed at all in the last 20 years. What new developments happened since PCRE, which already existed 20 years ago?

Last edited by lucmove; 03-27-2023 at 01:57 PM.
 
Old 03-27-2023, 02:10 PM   #13
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,223

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
I take it you already know about \b?
 
Old 03-27-2023, 02:43 PM   #14
metaed
Member
 
Registered: Apr 2022
Location: US
Distribution: Slackware64 15.0
Posts: 363

Rep: Reputation: 170Reputation: 170
Quote:
Originally Posted by lucmove View Post
I have many lines, each line has at least one word and I want to add every individual word to a control list.
Then today grep -o might be your best friend. You would just

grep -o '[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*'

This is, of course, how the command looks before you add stdin redirection, or before you list input files on the grep command line.

The function of grep -o is to "print only the matched (non-empty) parts of a matching line, with each such part on a separate output line".

I should add that the -o option is a Gnu extension. Most Linux distros will have it. If you are limited to the basic Unix command line, you'll need a different solution.
 
Old 03-27-2023, 08:20 PM   #15
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
How would that handle the afore-mentioned B-52 ?.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Words, Words, Words--Introducing OpenSearchServer LXer Syndicated Linux News 0 08-07-2019 02:13 PM
[SOLVED] jhalfs sed: -e expression #1, char 55:Invalid preceding regular expression percy_vere_uk Linux From Scratch 10 07-22-2017 07:15 AM
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 07:42 AM
How do I create words.db from words.txt using gdbm? kline General 8 12-14-2008 08:48 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration