LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Would like to share some common regex patterns... (https://www.linuxquestions.org/questions/linux-newbie-8/would-like-to-share-some-common-regex-patterns-4175487512/)

Madhu Desai 12-10-2013 12:07 PM

Would like to share some common regex patterns...
 
2 Attachment(s)
Hi All

Recently, as part of my job, i had to search for commonly known strings from some large text files. That job proved to be not that easy. After goggling for a day and searching many books for some good regex expressions, i came up with these regex patterns. Although, many experts may have already known/used these, i thought it may be of some help for newbies.

Also, I, being a novice in scripting/regex, the scripts i create may not be the best but will sure get the job done. In that sense, i really appreciate if there are any tips to do this in a better way.

This script matches almost all valid MAC, IP, URL and Email addresses.

~/.bashrc
PHP Code:

# Regular Expressions
if [ -~/.MyRegex.sh ]; then
    
. ~/.MyRegex.sh
fi 

~/.MyRegex.sh
PHP Code:

#!/bin/bash
#####################################################
# RegEx patterns for complicated searches.
# Source: Mostly from google and other books
# Shell: Bash
#####################################################

## Alias grep with color
alias grep='grep --color=auto'

## +++++++++++++++++++++++++++++ PATTERNS +++++++++++++++++++++++++++++++++++

## Pattern for valid MAC address:
MACGREP=$(cat <<_MACGREP
\b[0-9a-f]{2}(:[0-9a-f]{2}){5}\b
_MACGREP
)

## Pattern for valid IP4 address:
IPGREP=$(cat <<_IPGREP
\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
_IPGREP
)

## Pattern for valid URL address.
URLGREP=$(cat <<_URLGREP
((?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz))
_URLGREP
)

## Pattern for valid EMail address:
EMAILGREP=$(cat <<_EMAILGREP
(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*|"(?:[\x01-\x08\
x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]
*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])
_EMAILGREP
)

## +++++++++++++++++++++++++++++ FUNCTIONS +++++++++++++++++++++++++++++++++++

## MAC Address: Use with -Ei. Example 'ifconfig eth0 | grep -Ei "
$(MAC)"'
MAC() {
    echo "
$MACGREP"
}

## IP Address: Use with -E. example 'ifconfig eth0 | grep -E "
$(IP4)"'
IP4() {
    echo "
$IPGREP"
}

## URL Address: Use with -Pi. example 'rpm -qi $(rpm -qf $(which --skip-alias grep)) | grep -Pi "
$(URL)"'
URL() {
    echo "
$URLGREP"
}

## EMAIL Address: Use with -Pi. example 'man mailto.conf | col -b | grep -Pi "
$(EML)"'
EML() {
    echo "
$EMAILGREP"


Testing:
Code:

$ grep -Ei "$(MAC)" testfile
90:2C:84:09:76:B3 = Correct
52:54:00:b3:DC:95 = Correct
52:54:00:1A:DB:bc = Correct
00:50:56:C0:00:01 = Correct


$ grep -E "$(IP4)" testfile
192.168.1.74 = Correct
10.10.10.122 = Correct
1.1.1.1 = Correct
http://192.168.1.1 = Correct


$ grep -Pi "$(URL)" testfile
https://www.mybank.co.us = Correct
www.youtube.com = Correct
http://localhost = Correct
http://192.168.1.1 = Correct
ftp://subdomain.domain.org = Correct
http://MyBusiness.biz = Correct
https://www.google.co.in/?gws_rd=cr&ei=FDynUqnCJ877rAeawoHQCg = Correct


$ grep -Pi "$(EML)" testfile
madhu.abcd@gmail.com = Correct
hi_howdy@yahoo.co.in = Correct
test@sifymail.com = Correct
test-33@hotmail.co.uk = Correct
hostmaster@subdomain.domain.local = Correct

Hope it helps.

Thanks.

Note: Please remove .txt extension from attachments.

grail 12-10-2013 02:54 PM

I am a little confused as why you need functions? Could you not just set the variables equal to your regexes?

Also, why is it the email option is quite in depth but for urls apparently there seem very little restrictions?

Madhu Desai 12-10-2013 03:25 PM

Quote:

Originally Posted by grail (Post 5078581)
I am a little confused as why you need functions? Could you not just set the variables equal to your regexes?

Trust me, even i am confused. But the point is, at first i tried variables for simple regex like MAC and IP address. But when regex got complicated like too many single/double quotes and braces, it started throwing errors. In the end, i used heredoc to store values.

Also, i thought by putting them in functions, i can use them on any commands like,

Code:

$ ifconfig eth0 | grep -E "$(IP4)"
$ route -n | grep -E "$(IP4)"
$ ifconfig | sed -rn "/$(IP4)/p"

etc.

Quote:

Originally Posted by grail (Post 5078581)
Also, why is it the email option is quite in depth but for urls apparently there seem very little restrictions?

In the beginning, i used simple regex for email something like this:

'\b[a-z0-9]{1,}@*\.(com|net|uk|mil|gov|edu)\b'

But it did not capture complex emails with multiple dot/number inbetween them. So after googling for a while, i got a link that said there is regex The Official Standard: RFC 5322 for emails. so i used that.

grail 12-10-2013 06:47 PM

Quote:

But the point is, at first i tried variables for simple regex like MAC and IP address. But when regex got complicated like too many single/double quotes and braces, it started throwing errors. In the end, i used heredoc to store values.
Seem to work just fine for me?? The only adjustment I had to make was for EML where I used double quotes and escaped the 2 double quotes used inside the regex.

Quote:

Also, i thought by putting them in functions, i can use them on any commands like,
As far as I can tell there is no diff between using variables or functions.

I would probably use functions if I wanted to use them in place of calling something like grep, like gmac to be for grep mac addresses.

I would also add that as you are unlikely to add to this script to then use it as an actual script, there is no need to place the interpreter at the top of the file as it will always
be used as a sourced file.

So maybe something like:
Code:

## Alias grep with color
alias grep='grep --color=auto'

## +++++++++++++++++++++++++++++ PATTERNS +++++++++++++++++++++++++++++++++++

## Pattern for valid MAC address:
MAC='b[0-9a-f]{2}(:[0-9a-f]{2}){5}b'

## Pattern for valid IP4 address:
IP='b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b'

## Pattern for valid URL address.
URL='((?i)(http|ftp|www)(S+)|(S+) (.gov|.us|.net|.com|.edu|.org|.biz))'

## Pattern for valid EMail address:
EML="(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*|\"(?:[\x01-\x08\
x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0cx0e-x7f])*\")@(?:(?:
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]
*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])"

## +++++++++++++++++++++++++++++ FUNCTIONS +++++++++++++++++++++++++++++++++++

gmac()
{
    grep -Ei "$MAC" "$@"
}

# or instead of one for each
g_reg()
{
    regex=$1
    shift

    grep -Pi "${!regex}" "$@"
}

I'll leave you to validate the usage for the second function, but you get the idea :)

Madhu Desai 12-11-2013 12:19 AM

@grail

Thanks for responding. You're right, variables can be used instead. Its really funny why i didn't think about that. Probably because i was working continuously for 13 hrs stretch and it was 3 late midnight or simply this regex got on my nerves, i don't know.

Anyway, i just booted my PC, i'm fresh and now i know what mistakes i did. Sorry for also confusing you and others and also dragging in other direction. I'm glad i posted this on LQ before posting it on my company's intranet. As you correctly suggested all i needed was to put it in variable and export it.

This was all i needed... just put them in .bashrc

~/.bashrc
Code:

alias grep='grep --color=auto'

export MAC='\b[0-9a-f]{2}(:[0-9a-f]{2}){5}\b'

export IP4='\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'

export URL='\b((?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz))\b'

export EML='\b(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*'\
'|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")'\
'@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4]'\
'[0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:'\
'[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b'

And it's working like a charm..
Code:

$ . ~/.bashrc
$ ifconfig | grep -Ei "$MAC"
$ ifconfig | grep -E "$IP4"
$ rpm -qi $(rpm -qf $(which --skip-alias grep)) | grep -Pi "$URL"
$ man mailto.conf | col -b | grep -Pi "$EML"

Thank you very much :hattip:


All times are GMT -5. The time now is 10:05 PM.