LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Seeking a clever RegEx for text processing (http://www.linuxquestions.org/questions/programming-9/seeking-a-clever-regex-for-text-processing-4175432356/)

danielbmartin 10-15-2012 04:34 PM

Seeking a clever RegEx for text processing
 
Eleven days ago LQ newbie r_clark2 initiated a thread titled Need help writing a simple sed script.

Susequently moderator acid_kewpie recognized the post as homework, a violation of forum rules, and locked the thread. No complaint there.

Enough time has elapsed to make that homework overdue, so I'd like to exhume the problem.

Have a file of this nature:
Code:

Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Dennis T. Morgan:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400

Want a file of this nature:
Code:

Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400

Two transformations are required:
1) Change "FirstName LastName" to "LastName, FirstName".
2) Remove salary number if it ends in 500.

I tackled this problem as a learning exercise and developed this sed solution:
Code:

sed 's/\([^:]*\) \([^:]*\):/\2, \1:/' $InFile  \
|sed 's/\(.*\)\(:[0-9]*500$\)/\1/'

I think the whole job could be done with only one sed but cannot devise the RegEx to do it. sed gurus, please advise.

Daniel B. Martin

alinas 10-15-2012 04:48 PM

You can supply more than one editing command to sed: sed -e 'cmd1' -e 'cmd2' $InFile

sundialsvcs 10-16-2012 11:22 AM

I personally would write a short Perl script to do it. Or perhaps I would use "awk."

Basically, I think that you run into unnecessary problems very quickly when you either (a) "I can name-that-tune in one magical (but entirely unmaintainable...) sed script," and/or (b) "I can do <<anything at all>> in Bash, so there!!"

What you categorically need to do, instead, is to locate the tool that will, in one step and with one tool and in a maintainable way, get you from start to finish. The solution needs to be readable, and, when (inevitably...) a change to the requirement surfaces, it needs to be possible to very quickly and reliably add support for that change without having to reconstruct it. The solution should not be "chicken scratches," but so many in-production cases are exactly that. ("It works, but you dare don't touch it, or even look at it sideways!")

awk is a tool that was designed for this sort of thing, and the entire Perl language was originally an off-shoot of that. An awk script, in its simplest form, simply consists of a series of regular-expressions, but it has a programming-language element to it also. Perl has rightly been called the Swiss ArmyŽ Knife of data processing. And both of these power-tools are no doubt right now at your beck-and-call, and will be, anywhere your solution might need to be deployed.

danielbmartin 10-16-2012 12:32 PM

Quote:

Originally Posted by sundialsvcs (Post 4807268)
awk is a tool that was designed for this sort of thing ...

I should have mentioned that I solved this problem with awk before even attempting to do it with sed. This is my awk solution...
Code:

sed 's/ /\:/' $InFile \
|awk -F ":" '{print $2", "$1":"$3":"$4":"$5":"$6}' \
|rev \
|awk -F ":"  \
  '{if (substr($1,1,3)=="005") \
    print $2":"$3":"$4":"$5;  \
    else print $0}'            \
|rev

... not elegant but it works.

I am learning awk and sed and Linux programming in general. I tackled this problem as a learning exercise -- that's the reason for working on two solutions. Learn by doing.

So the question remains on the table: is there a clever RegEx which can do the whole job?

Daniel B. Martin

ntubski 10-16-2012 01:18 PM

Quote:

Originally Posted by danielbmartin (Post 4807340)
So the question remains on the table: is there a clever RegEx which can do the whole job?

Yes. I use GNU sed's -r option because this is ugly enough without backslashes covering everything.
Code:

# FirstName = \1, LastName = \2, MiddleFields = \3,
# SalaryAll = \4, SalaryEndingIn500 = \5, SalaryOthers = \6
sed -r 's/([^:]*) ([^:]*):(.*)((:[0-9]*500)|(:[0-9]*))$/\2, \1:\3\6/' $InFile

Also, in the same vein as alanis' suggestion of using multiple -e arguments, you can sequence commands using ";" in sed:
Code:

sed 's/\([^:]*\) \([^:]*\):/\2, \1:/; s/\(.*\)\(:[0-9]*500$\)/\1/' $InFile
Here's a simpler awk solution:
Code:

awk -F: -vOFS=: '{
  last_name_index = match($1, / [^ ]+$/);
  $1 = substr($1, last_name_index+1) ", " substr($1, 1, last_name_index-1);
  if ($NF ~ /500$/) NF--;
  print
}'


firstfire 10-16-2012 01:36 PM

Hi.

This looks a bit shorter:
Code:

$ sed -r 's/([^:]*) ([^: ]*):/\2, \1:/; s/:[0-9]*500$//' infile
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400


danielbmartin 10-16-2012 02:21 PM

Thank you, ntubski and firstfire, for your valued input. We still don't have a solution using a single RegEx... but there's no sense in beating one's brains out to construct one complex RegEx when two simpler ones do the job nicely. This thread is solved!

Daniel B. Martin

ntubski 10-16-2012 02:38 PM

Quote:

Originally Posted by danielbmartin (Post 4807424)
We still don't have a solution using a single RegEx... but there's no sense in beating one's brains out to construct one complex RegEx when two simpler ones do the job nicely.

Please reread my post, the first sed command is a single regex solution. Admittedly it's longer than the 2 regex solution...

firstfire 10-16-2012 02:46 PM

Yes, ntubski already gave single-regex solution. Here is another one
Code:

$ sed -r 's=([^:]*) ([^: ]*)(:.*/..)(:[0-9]*500)?$=\2, \1\3=' in
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Dennis T. Morgan:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400

It depends on the fact that substring in front of salary looks like '/..'

EDIT: Now I see, it did not work (text in red).
EDIT: Using perl's non-greedy (*?) pattern:
Code:

$ perl -pe 's/([^:]*) ([^: ]*):(.*?)(:[0-9]*500)?$/\2, \1:\3/' infile
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400

One may strip few more characters:
Code:

$ perl -pe 's/(.*?) (\w*):(.*?)(:[0-9]*500)?$/\2, \1:\3/' in

ntubski 10-16-2012 03:19 PM

Quote:

Originally Posted by firstfire (Post 4807443)
EDIT: Now I see, it did not work.

Ah, but the tiniest of changes can make it work:
Code:

% sed -r 's=([^:]*) ([^: ]*)(:.*/..)(:[0-9]*500$)?=\2, \1\3=' people.txt
Blenheim, Steve:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Boop, Betty:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23
Chevsky, Igor:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:45001
Morgan, Dennis T.:500-462-6542:500 Lynn Road, Troy, NY 12180:3/31/52:41400

Much shorter than my attempt, but still a bit longer than the 2 regex.

firstfire 10-16-2012 03:42 PM

Quote:

Originally Posted by ntubski (Post 4807477)
Ah, but the tiniest of changes can make it work:

Excellent! You beat me on my own field! :) Why didn't I figured it out myself?..

danielbmartin 10-16-2012 04:26 PM

Quote:

Originally Posted by ntubski (Post 4807435)
Please reread my post, the first sed command is a single regex solution.

You are right... my apologies!

Daniel B. Martin

grail 10-17-2012 12:32 PM

Awk:
Code:

awk -F: '{$1 = gensub(/(.*) (.*)/,"\\2, \\1","1",$1)}$NF ~ /500$/{NF = NF - 1}1' OFS=":" file
Ruby:
Code:

ruby -F: -ape '$_ = [$F[0].scan(/(.*) (.*)/)[0].reverse.join(", "),$F[1..($F[-1]=~/500$/?-2:-1)]].join(":").chomp + "\n"' file


All times are GMT -5. The time now is 06:03 AM.