ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
That should cover all words including words that have numbers such as "B52" (but not "B-52").
The problem is, I want to consider a word:
- a string of letters, no spaces
- a string of letters and numbers, no spaces
- no string of numbers only. A word can only be a word if it contains at least one letter. It may contain numbers but must not contain only numbers.
My regular expression above doesn't filter out strings of numbers only.
In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.
Or, you might look at just grabbing what is between word boundaries. In some styles of regular expression, those are marked as \b for both the start and end. In others, they are represented as \< and \> for start and end.
I would imagine that the OP has already discovered that the generic classes like [:alnum:] include the minus character. The word-bounding constructs suffer similarly. And I'd imagine lines would comprise of more than one candidate.
www.regex101.com is a very good page to check your regexps. Otherwise it depends on your locale (language), your tool (if you use awk, perl, java, grep or ??).
You might need to solve it in more steps, so for example "no string of numbers only" can be a second check, which will definitely simplify the regex
(using PCRE: \w+ to check if it is a word, \d+ if it contains only numbers).
I am afraid of reacting in a precise way to a blurry situation, mostly because it will not help the OP durably. Would it not be sufficient to state that non-white-space and white-space are rather specific and can be attacked individually?
Otherwise... take Jeffrey Friedl and roll your own.
In which context? If you are talking about POSIX or Extended Regular expressions, then there will still be a lot more letters to add, depending on how many languages you are going to cover. Perl and PCRE has the \w special escape character for any word character from Unicode.
Any one will do. It just has to enforce the desired rules.
Quote:
Originally Posted by astrogeek
Character classes would make it more readable and respect the locale in effect for many applications.
This should identify words according to your definition, one per line, in many common contexts:
Code:
/^[[:alnum:]]*[[:alpha:]][[:alnum:]]*$/
That doesn't work. It won't filter out strings of numbers only.
That's rude. The great idea everyone is re-stating is that the answer so far is "it depends", because you've not clarified sufficiently.
Many of us don't want to hand you a literal answer that we know will introduce bugs and require further support/explaining, only for the real solution to be something else entirely.
The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zĄ-’].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:
[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*
But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:
I prefer Laurent Riesterer' Visual Regexp application.
You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.
The definition of "word" given by OP can be restated as:
A "word" MUST consist of at least one letter.
A "letter" is defined as a member of the character class [A-Za-zĄ-’].
The letter MAY be preceded or followed by one or more numbers and letters.
A "number" is defined as a member of the character class [0-9].
This can be expressed as a regular expression:
[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*
But if the intended use is to read a file of text and determine whether each line, as a whole, is or is not a word, then the start and end of line 0-length assertions would also be needed:
^[0-9A-Za-zĄ-’]*[A-Za-zĄ-’][0-9A-Za-zĄ-’]*$
Good breakdown. But I don't want to determine if each line is a word. I have many lines, each line has at least one word and I want to add every individual word to a control list.
Quote:
Originally Posted by lucmove
No great ideas so far,
This is irrelevant. Some good comments but no great ideas so far, which is true and is normal. Nothing to get offended about.
EDIT:
Quote:
Originally Posted by pan64
You ought to post a link to it. maybe this one: http://laurent.riesterer.free.fr/regexp/ ?
This is more than 15 years old. Probably that was good, but now the regex engines are more powerful, so it is obsolete now. Anyway, you can use that too.
It's more than 20 years old, and it's awesome. I never knew that regular expressions had changed at all in the last 20 years. What new developments happened since PCRE, which already existed 20 years ago?
This is, of course, how the command looks before you add stdin redirection, or before you list input files on the grep command line.
The function of grep -o is to "print only the matched (non-empty) parts of a matching line, with each such part on a separate output line".
I should add that the -o option is a Gnu extension. Most Linux distros will have it. If you are limited to the basic Unix command line, you'll need a different solution.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.