LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-10-2013, 01:07 PM   #1
mddnix
Member
 
Registered: Mar 2013
Location: Bengaluru, India
Distribution: Redhat, Arch, Ubuntu
Posts: 498

Rep: Reputation: 137Reputation: 137
Would like to share some common regex patterns...


Hi All

Recently, as part of my job, i had to search for commonly known strings from some large text files. That job proved to be not that easy. After goggling for a day and searching many books for some good regex expressions, i came up with these regex patterns. Although, many experts may have already known/used these, i thought it may be of some help for newbies.

Also, I, being a novice in scripting/regex, the scripts i create may not be the best but will sure get the job done. In that sense, i really appreciate if there are any tips to do this in a better way.

This script matches almost all valid MAC, IP, URL and Email addresses.

~/.bashrc
PHP Code:
# Regular Expressions
if [ -~/.MyRegex.sh ]; then
    
. ~/.MyRegex.sh
fi 
~/.MyRegex.sh
PHP Code:
#!/bin/bash
#####################################################
# RegEx patterns for complicated searches.
# Source: Mostly from google and other books
# Shell: Bash
#####################################################

## Alias grep with color
alias grep='grep --color=auto'

## +++++++++++++++++++++++++++++ PATTERNS +++++++++++++++++++++++++++++++++++

## Pattern for valid MAC address:
MACGREP=$(cat <<_MACGREP
\b[0-9a-f]{2}(:[0-9a-f]{2}){5}\b
_MACGREP
)

## Pattern for valid IP4 address:
IPGREP=$(cat <<_IPGREP
\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
_IPGREP
)

## Pattern for valid URL address.
URLGREP=$(cat <<_URLGREP
((?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz))
_URLGREP
)

## Pattern for valid EMail address:
EMAILGREP=$(cat <<_EMAILGREP
(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*|"(?:[\x01-\x08\
x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]
*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])
_EMAILGREP
)

## +++++++++++++++++++++++++++++ FUNCTIONS +++++++++++++++++++++++++++++++++++

## MAC Address: Use with -Ei. Example 'ifconfig eth0 | grep -Ei "
$(MAC)"'
MAC() {
    echo "
$MACGREP"
}

## IP Address: Use with -E. example 'ifconfig eth0 | grep -E "
$(IP4)"'
IP4() {
    echo "
$IPGREP"
}

## URL Address: Use with -Pi. example 'rpm -qi $(rpm -qf $(which --skip-alias grep)) | grep -Pi "
$(URL)"'
URL() {
    echo "
$URLGREP"
}

## EMAIL Address: Use with -Pi. example 'man mailto.conf | col -b | grep -Pi "
$(EML)"'
EML() {
    echo "
$EMAILGREP"

Testing:
Code:
$ grep -Ei "$(MAC)" testfile
90:2C:84:09:76:B3 = Correct 
52:54:00:b3:DC:95 = Correct
52:54:00:1A:DB:bc = Correct 
00:50:56:C0:00:01 = Correct

$ grep -E "$(IP4)" testfile
192.168.1.74 = Correct
10.10.10.122 = Correct
1.1.1.1 = Correct
http://192.168.1.1 = Correct

$ grep -Pi "$(URL)" testfile
https://www.mybank.co.us = Correct
www.youtube.com = Correct
http://localhost = Correct
http://192.168.1.1 = Correct
ftp://subdomain.domain.org = Correct
http://MyBusiness.biz = Correct
https://www.google.co.in/?gws_rd=cr&ei=FDynUqnCJ877rAeawoHQCg = Correct

$ grep -Pi "$(EML)" testfile
madhu.abcd@gmail.com = Correct
hi_howdy@yahoo.co.in = Correct
test@sifymail.com = Correct
test-33@hotmail.co.uk = Correct
hostmaster@subdomain.domain.local = Correct
Hope it helps.

Thanks.

Note: Please remove .txt extension from attachments.
Attached Files
File Type: txt MyRegex.sh.txt (1.8 KB, 2 views)
File Type: txt testfile.txt (947 Bytes, 2 views)

Last edited by mddnix; 12-10-2013 at 01:23 PM.
 
Old 12-10-2013, 03:54 PM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
I am a little confused as why you need functions? Could you not just set the variables equal to your regexes?

Also, why is it the email option is quite in depth but for urls apparently there seem very little restrictions?
 
1 members found this post helpful.
Old 12-10-2013, 04:25 PM   #3
mddnix
Member
 
Registered: Mar 2013
Location: Bengaluru, India
Distribution: Redhat, Arch, Ubuntu
Posts: 498

Original Poster
Rep: Reputation: 137Reputation: 137
Quote:
Originally Posted by grail View Post
I am a little confused as why you need functions? Could you not just set the variables equal to your regexes?
Trust me, even i am confused. But the point is, at first i tried variables for simple regex like MAC and IP address. But when regex got complicated like too many single/double quotes and braces, it started throwing errors. In the end, i used heredoc to store values.

Also, i thought by putting them in functions, i can use them on any commands like,

Code:
$ ifconfig eth0 | grep -E "$(IP4)"
$ route -n | grep -E "$(IP4)"
$ ifconfig | sed -rn "/$(IP4)/p"
etc.

Quote:
Originally Posted by grail View Post
Also, why is it the email option is quite in depth but for urls apparently there seem very little restrictions?
In the beginning, i used simple regex for email something like this:

'\b[a-z0-9]{1,}@*\.(com|net|uk|mil|gov|edu)\b'

But it did not capture complex emails with multiple dot/number inbetween them. So after googling for a while, i got a link that said there is regex The Official Standard: RFC 5322 for emails. so i used that.

Last edited by mddnix; 12-10-2013 at 04:31 PM.
 
Old 12-10-2013, 07:47 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Quote:
But the point is, at first i tried variables for simple regex like MAC and IP address. But when regex got complicated like too many single/double quotes and braces, it started throwing errors. In the end, i used heredoc to store values.
Seem to work just fine for me?? The only adjustment I had to make was for EML where I used double quotes and escaped the 2 double quotes used inside the regex.

Quote:
Also, i thought by putting them in functions, i can use them on any commands like,
As far as I can tell there is no diff between using variables or functions.

I would probably use functions if I wanted to use them in place of calling something like grep, like gmac to be for grep mac addresses.

I would also add that as you are unlikely to add to this script to then use it as an actual script, there is no need to place the interpreter at the top of the file as it will always
be used as a sourced file.

So maybe something like:
Code:
## Alias grep with color
alias grep='grep --color=auto'

## +++++++++++++++++++++++++++++ PATTERNS +++++++++++++++++++++++++++++++++++

## Pattern for valid MAC address:
MAC='b[0-9a-f]{2}(:[0-9a-f]{2}){5}b'

## Pattern for valid IP4 address:
IP='b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b'

## Pattern for valid URL address.
URL='((?i)(http|ftp|www)(S+)|(S+) (.gov|.us|.net|.com|.edu|.org|.biz))'

## Pattern for valid EMail address:
EML="(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*|\"(?:[\x01-\x08\
x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0cx0e-x7f])*\")@(?:(?:
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]
*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])"

## +++++++++++++++++++++++++++++ FUNCTIONS +++++++++++++++++++++++++++++++++++

gmac()
{
    grep -Ei "$MAC" "$@"
}

# or instead of one for each
g_reg()
{
    regex=$1
    shift

    grep -Pi "${!regex}" "$@"
}
I'll leave you to validate the usage for the second function, but you get the idea
 
1 members found this post helpful.
Old 12-11-2013, 01:19 AM   #5
mddnix
Member
 
Registered: Mar 2013
Location: Bengaluru, India
Distribution: Redhat, Arch, Ubuntu
Posts: 498

Original Poster
Rep: Reputation: 137Reputation: 137
@grail

Thanks for responding. You're right, variables can be used instead. Its really funny why i didn't think about that. Probably because i was working continuously for 13 hrs stretch and it was 3 late midnight or simply this regex got on my nerves, i don't know.

Anyway, i just booted my PC, i'm fresh and now i know what mistakes i did. Sorry for also confusing you and others and also dragging in other direction. I'm glad i posted this on LQ before posting it on my company's intranet. As you correctly suggested all i needed was to put it in variable and export it.

This was all i needed... just put them in .bashrc

~/.bashrc
Code:
alias grep='grep --color=auto'

export MAC='\b[0-9a-f]{2}(:[0-9a-f]{2}){5}\b'

export IP4='\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'

export URL='\b((?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz))\b'

export EML='\b(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-]+)*'\
'|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")'\
'@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4]'\
'[0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:'\
'[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b'
And it's working like a charm..
Code:
$ . ~/.bashrc
$ ifconfig | grep -Ei "$MAC"
$ ifconfig | grep -E "$IP4"
$ rpm -qi $(rpm -qf $(which --skip-alias grep)) | grep -Pi "$URL"
$ man mailto.conf | col -b | grep -Pi "$EML"
Thank you very much

Last edited by mddnix; 12-11-2013 at 02:33 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep for common patterns in 2 files dann_radkov Linux - Newbie 7 03-07-2012 03:51 AM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 02:10 AM
[SOLVED] regex question - weed repeating chars/patterns samji9999 Programming 5 08-20-2010 09:42 AM
Perl only matching single-character regex patterns? Lordandmaker Programming 3 01-20-2009 09:59 AM
LXer: Five common PHP design patterns LXer Syndicated Linux News 0 07-21-2006 02:24 AM


All times are GMT -5. The time now is 04:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration