LinuxQuestions.org - [SOLVED] Regex in Linux does not work

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Regex in Linux does not work (https://www.linuxquestions.org/questions/linux-newbie-8/regex-in-linux-does-not-work-4175558450/)

frodobag

11-09-2015 10:00 PM

Regex in Linux does not work

Hi All,

Here's the script I was testing. In Linux my shell enviroment is bash.
The objective is to test if my input is a whole number like 1, or 52 or 1000 and running the script it will not say anything as expected. Otherwise for any other input that doesn't match the criteria it will say "error: Not a number" and quit.

#!/bin/bash
re='^[0-9]+$'
printf "`echo -n Enter a number or anuthing to test:` \n"
read char
if ! [[ "$char" =~ "$re" ]]; then
echo "error: Not a number !!!" >&2; exit 1
fi

FYI the above script works fine in one Linux pc in a bash shell.

But the same script when used in another Linux pc which uses Bourne shell (sh) - it does not work :(
All the whole numbers and everything else it gives the error message. Can somebody please help shed some light ?

thanks,
frodobag

frankbell

11-09-2015 10:16 PM

What distro is on that other PC?

frodobag

11-09-2015 10:17 PM

Ubuntu

berndbausch

11-09-2015 10:35 PM

Quote:

Originally Posted by frodobag (Post 5447301)

FYI the above script works fine in one Linux pc in a bash shell.

But the same script when used in another Linux pc which uses Bourne shell (sh) - it does not work :(
All the whole numbers and everything else it gives the error message. Can somebody please help shed some light ?

Bash is an extension of the Bourne shell. This means that Bash understands Bourne shell syntax, but has syntax elements that the Bourne shell doesn't understand.

In particular, I think the [[ ... ]] construct doesn't exist in the Bourne shell.

What is the error message?

EDIT: Your program doesn't work because of the following (from the bash reference guide):

Quote:

If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched as a string

That is, since you put quotes around $re, you are testing "if $char matches the string $re", not the intended "if $char matches the regular expression $re".
If char has a value of, say, 'fdg^[0-9]+$lk', the expression [[ $char =~ "$re" ]] will be true.

Thus, to check if char is a number, remove the quotes around $re. You can also remove the quotes around $char, since they are not needed inside [[ ... ]].

frodobag

11-09-2015 10:48 PM

no error message. Just that after I run the script and when I type in 1, 52, or 1000 it erroneously outputs "error: Not a number !!" instead of outputting nothing and quietly exiting as would be expected.

berndbausch

11-09-2015 10:53 PM

Quote:

Originally Posted by frodobag (Post 5447311)

See my updated post above. For giggles, try entering fdg^[0-9]+$lk and see what happens.

frodobag

11-09-2015 11:07 PM

Nope that didn't work. But oddly enough I was trying some variations and with re='^[0-9]+$' and now it works! But thanks all, at least it jogs my thoughts a bit.

frodobag

11-09-2015 11:25 PM

A side note: I am using the above test script for a more complex script to read a log file. Now since when I type in whole numbers on the keyboard , I guess the regex recognise it as actually numbers so it is correct. But when I extract a value with my more complex script from the log file using grep , cut ,sed, that value , although I see it as a number but is it possible the regex comparison I use above, "sees" it as a text ? and maybe thats why it says "not a number" ?

syg00

11-10-2015 12:54 AM

I don't use Bourne shell, but the bash abs guide notes that both the [[ ... ]] extended test and regex match aren't supported in Bourne and are portability issues.

chrism01

11-10-2015 01:08 AM

Yeah, the original sh shell is less capable than bash (hence the name ;) ).
I'd stick with the latter, unless of course you want to move up to eg Perl (which is red hot on regexes...)

berndbausch

11-10-2015 01:21 AM

Quote:

Originally Posted by frodobag (Post 5447323)

The =~ operator matches text. It has no notion of numbers.

Your bash fragment above says "not a number" because when your $re is surrounded by quotes, you match $char against a mere string, not a regular expression. Since 12345 doesn't contain the string ^[0-9]+$, the test fails. When you remove the quotes, $re is interpreted as a regexp and the test succeeds.

You say "nope it doesn't work", but I wonder what it is that doesn't work? I am curious to see your code, your input and your output.

Edit: I participate here in parts because I learn. I didn't know about these details of =~ and would like to gain an even deeper understanding. I don't insist for insistance's sake.

pan64

11-10-2015 01:39 AM

looks like you need to use:

Code:

# instead of "$re"

if ! [[ "$char" =~ $re ]]; then

chrism01

11-10-2015 04:39 PM

There's a couple of good explanations/HOWTOs here
http://www.tldp.org/LDP/abs/html/regexp.html
http://www.itworld.com/article/26933...pressions.html - this one has an example of your problem :)

frodobag

11-10-2015 07:21 PM

Thanks guys for the hints.

Here's a snippet of the other script:

#!/bin/bash
#re='^\d(\d)?(\d)?(\d)?(\d)?$ '
re='^[0-9]+$'
char=` ...just grepping some whole numbers from a log file here, like 1234 or 56, etc...`
if ! [[ $char =~ $re ]] ; then
echo "error: Not a number !!!"
else
echo " Whole number - good"
fi

I've tried... if ! [[ "$char" =~ $re ]] ; then
...as well...along with other regex but the output was "error: not a number" even when the char value was something like 1234, when I expect it to say " Whole number - good" instead. Only the char value is let's say a text like THISTEXT or with special characters like 1234-456:7:8 then it should say "error: Not a number". But as of now all these 3 examples it says " error" ,which at this point still doesn't work.

I probably need to find an alternative to the =~ operator and [[..]]

frodobag

11-10-2015 08:46 PM

I think I should elaborate char=` ...just grepping some whole numbers from a log file here, like 1234 or 56, etc...`
The way I use to grep might be the problem.

I use... tac logfile | grep "(1.)" |grep -E '[0-9]{1,4}' | head -1

So the line in the logfile gets selected for example.... (1.) This is a line 1234 and that's it. Date: 12-12-2015
So it will pick out 1234 correctly but still, with the extra grep -E command, it doesn't help. While the result of grep of 1234 is correct, the result of the comparison operator +~ is not, which always say "error:Not a number".

rknichols

11-10-2015 10:12 PM

Quote:

Originally Posted by frodobag (Post 5447850)

grep "(1.)" |grep -E '[0-9]{1,4}'

That grep is returning the whole line that contains the match, not just the part that matches. Maybe you have a line that looks like it contains just a number, but any trailing white space will be included. You need to include the "-o" (--only-matching) option to ensure that just the part of the line that matches is returned.

It would help if your "Not a number" message included the string (wrapped in quotes) that was rejected.

Diantre

11-10-2015 10:33 PM

Quote:

Originally Posted by frodobag (Post 5447850)

The way I use to grep might be the problem.

I use... tac logfile | grep "(1.)" |grep -E '[0-9]{1,4}' | head -1

So the line in the logfile gets selected for example.... (1.) This is a line 1234 and that's it. Date: 12-12-2015

Could you post a sample of your log file? I agree that the grep line could be causing the problems. If your input line is

Code:

(1.) This is a line 1234 and that's it. Date: 12-12-2015

and you pipe it to that grep command, you'll get the whole line, not only the integers.

Try echoing the $char variable just before entering the if:

Code:

...

echo "char: $char"

if ! [[ "$char" =~ $re ]] ; then

...

This way you can verify what exactly is in $char. Another debugging method is to use the "set -x" command at the beginning of the script.

Edit: just noticed that rknichols already suggested the same, oh well...

berndbausch

11-10-2015 10:44 PM

A remark in addition to rknichols' comment about grep: To debug your program, make ample use of echo, writing the contents of the variables in question to the screen. You can also switch debugging on and off using set -x and set +x, so that you see what happens in crucial parts of your program.

Had you done that, you would have seen that $char is indeed not a number and not wasted your time writing a post.

Edit: Looks like this is at least the 3rd time this suggestion is made...

frodobag

11-10-2015 10:59 PM

Thanks guys,

I would love to post a sample of the log file, but unfortunately its "classified". Hence I gave the example above just to illustrate. All I can say its a bunch of text, numbers and special characters bunched up together. Besides the tac and grep and head commands I actually had to use sed and cut commands to "pluck out" the numbers I wanted. But unfortunately depends on the time it also plucks out text and special characters. Hence the need to compare using regex to see if its numbers - which I want then to print it out. If its text and something else, then discard.

oh thanks berndbausch for the set -x suggestion, I was wondering ways to debug. I will give that a try.

frodobag

11-10-2015 11:15 PM

here are the results of the debug

+ char=$'25064\001'
+ echo $'25064\001'
25064
+ [[ 25064 =~ ^[0-9]+$ ]]
+ echo 'error: Not a number !!!'
error: Not a number !!!
+ exit 1

So the odd thing is what is that \001 doing there ?
Otherwise it seems to give out 25064 as a number

berndbausch

11-10-2015 11:33 PM

Quote:

Originally Posted by frodobag (Post 5447894)

How did you initialize $char?

frodobag

11-10-2015 11:40 PM

I did not initialise $char.
Its basically used as something like below to grep the number from a logfile:
char=`tac logfile | grep "(1.)" |grep -E '[0-9]{1,4}' | head -1`

Diantre

11-10-2015 11:58 PM

Quote:

Originally Posted by frodobag (Post 5447894)

So the odd thing is what is that \001 doing there ?
Otherwise it seems to give out 25064 as a number

"\001" is octal 1, ascii character #1, SOH (start of heading). Perhaps it's in your log file?

berndbausch

11-11-2015 12:36 AM

Quote:

Originally Posted by frodobag (Post 5447905)

I did not initialise $char.
Its basically used as something like below to grep the number from a logfile:
char=`tac logfile | grep "(1.)" |grep -E '[0-9]{1,4}' | head -1`

That's initialization.

As rnikols said, grep returns the whole line, not just the number.

pan64

11-11-2015 02:13 AM

you should give us a usable sample to be able to help you to construct a usable solution.
Personally I suggest you to do the following:

Code:

char=$(awk ' /^(1\.).*[0-9]{1,4}/ { parse lines, fetch relevant data } END { print that value } ' logfile)

no sed, no grep, not tac, no head and cut and a lot of different tricks, just a single awk (or perl/python/whatever)

From the other hand the error message is correct, you tried to compare a string which contained not only digits, but something else too.
awk will also ensure you have a valid number.

frodobag

11-11-2015 05:42 PM

Thanks, I was thinking of a similar sample to the one I am facing for such log lines

(123456:789.123)-{ABCDE.12345=456789:1234:ABCD.FGHJK:1111-CVBN543TGYH10:4564611:12312:5645=POIJKJH}

(123456:890.456)-{POIU.12345=456711:567834:ABCD.FGHJK9:2223-YYN543TGYH10:46646PPOUY^^&%:5775=POIJKJH}

(123456:990.888)-{POIU.12775=456709:1234:ABCD.FGHJK8:223-YYN543TGUU%10:466PPOUY%^%11:10777975=POIJKJH}

I am trying to pick out the numbers after "11:" So In the first line I want the numbers 123 from "11:123" and the second line I want 5678 from "11:5678" and the third I want 107779 from "11:107779" As you can see the "11:" jumps in various positions as the log progresses but as example in the same 3 positions randomly as the logs progresses. Not sure how I can achieve that with just awk. That's why I was using a bunch of tools like tac, sed, cut ,etc to pick out the latest one whenever I run the script. But I'm happy to explore any options.

rknichols

11-11-2015 06:52 PM

You can do that quite easily with sed:

Code:

sed -r -n 's/.*11:([0-9]+).*/\1/p'

The "-r" option says to use extended regular expressions. The "-n" inhibits the default printing of all lines. The expression looks for lines that contain "11:" followed by a string of one or more digits and possibly followed by more characters, replaces the entire line with just that string of digits from the parenthesized sub-expression, and prints the result.

frodobag

11-11-2015 08:33 PM

Many thanks rknichols! It worked ! :) I tested many times and worked flawless so far, when the numbers moved from 4 digits to 5 digits and was correct each time. Thanks , now I can proceed with the rest of the script.

chrism01

11-11-2015 08:50 PM

I thought I'd have a look at what would happen if an '11:' also appeared near the start of the first rec

Code:

(123456:789.123)-{ABCDE.12345=456711:1234:ABCD.FGHJK:1111-CVBN543TGYH10:4564611:12312:5645=POIJKJH}

but when I ran your sed it still picks out the 2nd occurrence. I can't seem to figure out why ;)
Could you elucidate?
Thx

syg00

11-11-2015 09:31 PM

Greediness,
Pretty dodgy to rely on it tho' ...

chrism01

11-11-2015 09:49 PM

Hey syg00,
how are you ? ;)

So you're saying the first '.*' effectively causes it to match the last occurrence?
I did a quick couple of tests and that seems to be the case.
How would you specify the first match then?
(this is worrying; I used to be ok at regexes...)

syg00

11-11-2015 09:55 PM

Yep - it matches the entire text. Then the regex engine starts working backwards until the next regex element matches.
Rinse, shake, repeat.

Of course, regex ain't regex. :p

chrism01

11-11-2015 09:58 PM

True; I need to re-read Friedl's book. Funnily enough I was planning to revise that today, hence the interest.
So (as per my last qn) how would you force it to match 1st occurrence instead?

EDIT: oh yeah; just started doing it in Perl : with the front & back wildcards you get the last one; without wildcards you get the first.

Still working on the sed version (I always was a bit iffy with sed...)

syg00

11-11-2015 10:25 PM

Sorry - forgot that q.
You can't in sed. You need non-greedy quantifiers, which last I looked sed doesn't support them. perlre is simplest soln.

chrism01

11-11-2015 11:10 PM

That explains why I'm banging my head on the desk.... its so nice to stop :)

pan64

11-12-2015 02:01 AM

what about this?
sed -r -n 's/11:([0-9]+).*/11:\1/;s/.*11://p'

syg00

11-12-2015 02:04 AM

You can't generalise it for an indeterminate number of occurrences of the required string (in sed)

pan64

11-12-2015 02:23 AM

obviously not, but can find first occurence

rknichols

11-12-2015 10:36 AM

Quote:

Originally Posted by syg00 (Post 5448439)

You can't generalise it for an indeterminate number of occurrences of the required string (in sed)

Never say, "It can't," with sed. It's language is, after all, Turing complete. Given enough time and resources, it can do anything any computer could ever do. Whether it is a practical candidate for a given application is a different question.

berndbausch

11-12-2015 11:38 AM

Quote:

Originally Posted by rknichols (Post 5448636)

Your assignment: Rewrite the Linux kernel in sed.

chrism01

11-12-2015 05:27 PM

Now that would impress even Linus

All times are GMT -5. The time now is 06:03 AM.