LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 10-15-2012, 04:34 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Seeking a clever RegEx for text processing


Eleven days ago LQ newbie r_clark2 initiated a thread titled Need help writing a simple sed script.

Susequently moderator acid_kewpie recognized the post as homework, a violation of forum rules, and locked the thread. No complaint there.

Enough time has elapsed to make that homework overdue, so I'd like to exhume the problem.

Have a file of this nature:
Code:
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Dennis T. Morgan:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
Want a file of this nature:
Code:
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
Two transformations are required:
1) Change "FirstName LastName" to "LastName, FirstName".
2) Remove salary number if it ends in 500.

I tackled this problem as a learning exercise and developed this sed solution:
Code:
sed 's/\([^:]*\) \([^:]*\):/\2, \1:/' $InFile  \
|sed 's/\(.*\)\(:[0-9]*500$\)/\1/'
I think the whole job could be done with only one sed but cannot devise the RegEx to do it. sed gurus, please advise.

Daniel B. Martin
 
Old 10-15-2012, 04:48 PM   #2
alinas
Member
 
Registered: Apr 2002
Location: UK, Sywell, EGBK
Distribution: RHEL, SuSE, CentOS, Debian, Ubuntu
Posts: 60

Rep: Reputation: 20
You can supply more than one editing command to sed: sed -e 'cmd1' -e 'cmd2' $InFile
 
Old 10-16-2012, 11:22 AM   #3
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,455

Rep: Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172
I personally would write a short Perl script to do it. Or perhaps I would use "awk."

Basically, I think that you run into unnecessary problems very quickly when you either (a) "I can name-that-tune in one magical (but entirely unmaintainable...) sed script," and/or (b) "I can do <<anything at all>> in Bash, so there!!"

What you categorically need to do, instead, is to locate the tool that will, in one step and with one tool and in a maintainable way, get you from start to finish. The solution needs to be readable, and, when (inevitably...) a change to the requirement surfaces, it needs to be possible to very quickly and reliably add support for that change without having to reconstruct it. The solution should not be "chicken scratches," but so many in-production cases are exactly that. ("It works, but you dare don't touch it, or even look at it sideways!")

awk is a tool that was designed for this sort of thing, and the entire Perl language was originally an off-shoot of that. An awk script, in its simplest form, simply consists of a series of regular-expressions, but it has a programming-language element to it also. Perl has rightly been called the Swiss ArmyŽ Knife of data processing. And both of these power-tools are no doubt right now at your beck-and-call, and will be, anywhere your solution might need to be deployed.

Last edited by sundialsvcs; 10-16-2012 at 11:24 AM.
 
Old 10-16-2012, 12:32 PM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Original Poster
Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Quote:
Originally Posted by sundialsvcs View Post
awk is a tool that was designed for this sort of thing ...
I should have mentioned that I solved this problem with awk before even attempting to do it with sed. This is my awk solution...
Code:
sed 's/ /\:/' $InFile \
|awk -F ":" '{print $2", "$1":"$3":"$4":"$5":"$6}' \
|rev \
|awk -F ":"  \
  '{if (substr($1,1,3)=="005") \
    print $2":"$3":"$4":"$5;   \
    else print $0}'            \
|rev
... not elegant but it works.

I am learning awk and sed and Linux programming in general. I tackled this problem as a learning exercise -- that's the reason for working on two solutions. Learn by doing.

So the question remains on the table: is there a clever RegEx which can do the whole job?

Daniel B. Martin
 
Old 10-16-2012, 01:18 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,541

Rep: Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878
Quote:
Originally Posted by danielbmartin View Post
So the question remains on the table: is there a clever RegEx which can do the whole job?
Yes. I use GNU sed's -r option because this is ugly enough without backslashes covering everything.
Code:
# FirstName = \1, LastName = \2, MiddleFields = \3, 
# SalaryAll = \4, SalaryEndingIn500 = \5, SalaryOthers = \6
sed -r 's/([^:]*) ([^:]*):(.*)((:[0-9]*500)|(:[0-9]*))$/\2, \1:\3\6/' $InFile
Also, in the same vein as alanis' suggestion of using multiple -e arguments, you can sequence commands using ";" in sed:
Code:
sed 's/\([^:]*\) \([^:]*\):/\2, \1:/; s/\(.*\)\(:[0-9]*500$\)/\1/' $InFile
Here's a simpler awk solution:
Code:
awk -F: -vOFS=: '{
  last_name_index = match($1, / [^ ]+$/);
  $1 = substr($1, last_name_index+1) ", " substr($1, 1, last_name_index-1);
  if ($NF ~ /500$/) NF--;
  print
}'

Last edited by ntubski; 10-16-2012 at 01:20 PM. Reason: reword
 
1 members found this post helpful.
Old 10-16-2012, 01:36 PM   #6
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Hi.

This looks a bit shorter:
Code:
$ sed -r 's/([^:]*) ([^: ]*):/\2, \1:/; s/:[0-9]*500$//' infile
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
 
1 members found this post helpful.
Old 10-16-2012, 02:21 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Original Poster
Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Thank you, ntubski and firstfire, for your valued input. We still don't have a solution using a single RegEx... but there's no sense in beating one's brains out to construct one complex RegEx when two simpler ones do the job nicely. This thread is solved!

Daniel B. Martin
 
Old 10-16-2012, 02:38 PM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,541

Rep: Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878
Quote:
Originally Posted by danielbmartin View Post
We still don't have a solution using a single RegEx... but there's no sense in beating one's brains out to construct one complex RegEx when two simpler ones do the job nicely.
Please reread my post, the first sed command is a single regex solution. Admittedly it's longer than the 2 regex solution...
 
1 members found this post helpful.
Old 10-16-2012, 02:46 PM   #9
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Yes, ntubski already gave single-regex solution. Here is another one
Code:
$ sed -r 's=([^:]*) ([^: ]*)(:.*/..)(:[0-9]*500)?$=\2, \1\3=' in
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Dennis T. Morgan:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
It depends on the fact that substring in front of salary looks like '/..'

EDIT: Now I see, it did not work (text in red).
EDIT: Using perl's non-greedy (*?) pattern:
Code:
$ perl -pe 's/([^:]*) ([^: ]*):(.*?)(:[0-9]*500)?$/\2, \1:\3/' infile
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
One may strip few more characters:
Code:
$ perl -pe 's/(.*?) (\w*):(.*?)(:[0-9]*500)?$/\2, \1:\3/' in

Last edited by firstfire; 10-16-2012 at 03:21 PM.
 
1 members found this post helpful.
Old 10-16-2012, 03:19 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,541

Rep: Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878Reputation: 878
Quote:
Originally Posted by firstfire View Post
EDIT: Now I see, it did not work.
Ah, but the tiniest of changes can make it work:
Code:
% sed -r 's=([^:]*) ([^: ]*)(:.*/..)(:[0-9]*500$)?=\2, \1\3=' people.txt
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400
Much shorter than my attempt, but still a bit longer than the 2 regex.
 
2 members found this post helpful.
Old 10-16-2012, 03:42 PM   #11
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Quote:
Originally Posted by ntubski View Post
Ah, but the tiniest of changes can make it work:
Excellent! You beat me on my own field! Why didn't I figured it out myself?..
 
Old 10-16-2012, 04:26 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Original Poster
Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Quote:
Originally Posted by ntubski View Post
Please reread my post, the first sed command is a single regex solution.
You are right... my apologies!

Daniel B. Martin
 
Old 10-17-2012, 12:32 PM   #13
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,692

Rep: Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987
Awk:
Code:
awk -F: '{$1 = gensub(/(.*) (.*)/,"\\2, \\1","1",$1)}$NF ~ /500$/{NF = NF - 1}1' OFS=":" file
Ruby:
Code:
ruby -F: -ape '$_ = [$F[0].scan(/(.*) (.*)/)[0].reverse.join(", "),$F[1..($F[-1]=~/500$/?-2:-1)]].join(":").chomp + "\n"' file
 
1 members found this post helpful.
  


Reply

Tags
regex, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] regex text ted_chou12 Programming 6 03-11-2012 08:55 AM
[SOLVED] Text Processing brainAcid Linux - Newbie 21 05-20-2011 11:19 AM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 02:10 AM
Can you parse text with regex? raypen Slackware 4 03-21-2006 01:26 PM
text processing Gantrep Linux - General 4 02-17-2003 11:37 PM


All times are GMT -5. The time now is 10:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration