-   Linux - General (
-   -   awk regexp for one character match (

nemobluesix 02-16-2009 12:27 AM

awk regexp for one character match

I'm working on an awk script and one of the rules should change strings like:

%some text %param_id% some text%

%some text 33 some text%
I tried with this expression

and it works great but it has a problem. If the original string ends with %param_id% like this

%some text %param_id%%
(which may happend very often) then the string matched is %param_id%% and not %param_id% and, of course, prm[1] becomes "id%" instead of "id".

I read the manual and it sais that the expression should look like

but that doesn't work :(

Any ideas?

Tinkster 02-16-2009 02:51 AM

I'm not 100% convinced that the tool (match) you're using
is the right tool for your job. What exactly do you want
to do with what's returned in your array prm?


[tink@tink:~]$ echo '%some text %param_id%%'|awk '{print gensub(/(.*)(%param_[^%]+%)(.*)/, "\\133\\3", "1")}'
%some text 33%
[tink@tink:~]$ echo '%some text %param_id% and now what%'|awk '{print gensub(/(.*)(%param_[^%]+%)(.*)/, "\\133\\3", "1")}'
%some text 33 and now what%

nemobluesix 02-16-2009 11:39 AM

Hi Tinkster,
Thanks for your reply.
You are right, gensub is more suited here than match. I was using match to save the name of the param to use it later in a sub function. It looks silly now :). I didn't know that gensub can do that too.
Based on your code I reached this solution:


$ echo "%text text %param_id% text text text%param_name%%" | awk '
> for(i=0;i<ARGC;i++){
> if(match(ARGV[i],/param_(.+)=(.+)/,p)) param[p[1]]=p[2];
> }
> }
> {
> print gensub(/%param_([^%]+)%/, param["\\1"], "g");
> print "debug: param_id - " param["id"];
> }' param_id=100 param_name=abc
%text text  text text text%
debug: param_id - 100

This is the best I could achieve. The output should have been:

%text text 100 text text textabc%
Why param["\\1"] is empty?

If you are curious, the whole process looks like this:
1) I call the script with an unknown number of arguments and with unknown names (before runtime) like this:

$ ./test.awk param_id=12 param_name=abc ... data_file
2) the BEGIN section reads all the param_* pairs and saves each value in an array, say param, with the names used as indexes like this:
3) I parse the data_file and replace each param_* with its value:
%param_id% becomes 12
%param_name% becomes abc

The ideea is that I want to replace "words" I don't know before calling the script. Maybe you have a better solution.

Tinkster 02-16-2009 12:42 PM

The problem here is that awk's regex' are greedy, and that this behaviour
(as far as I know) can not be modified. So the first expression "(.*)"
matches everything including %param_id% and only picks up the last bit.

The only work-around within awk will be two independent statements.

gensub(/(.*)(%param_id[^%]+%)(.*)/, "\\133\\3", "1")
gensub(/(.*)(%param_name[^%]+%)(.*)/, "\\133\\3", "1")

Regarding the flexible number of arguments - how are you handling
multiple param_names or ids on the command-line in terms of assignment
to variables and then arrays? W/o testing my gut says that only the
last thing on the command-line will be valid within the BEGIN section.

Tinkster 02-16-2009 12:56 PM

And on a second thought .... maybe awk isn't quite what you're after in
the first place ... have you considered using m4?

nemobluesix 02-16-2009 01:20 PM

I should have post a new reply instead of editing my post, maybe you did'n notice the change.

I'm ok now with the regexp, it works. The problem is that I can't use my array inside gensub. As you see above, inside gensub param["\\1"] evaluates to "" and outside, as expected, to the correct value.

I never used m4. Based on the problem described in my previous post, you think m4 would be better? And if yes, what's the learning curve? My awk script only needs this issue fixed :) and it's done.
Thanks again.

nemobluesix 02-16-2009 01:28 PM

I'm not using the arguments from the command line as variables as they were ment to be. I'm using both param_anything and value as values. param_id and param_name were just examples; they can be anything the user thinks of.
I couldn't find a better way to pass these things inside the script...

nemobluesix 02-16-2009 10:50 PM

working code
well... this combination might no be the best but it works

$ cat test.awk
#! /bin/awk -f

                if(match(ARGV[i],/param_(.+)=(.+)/,p)) param[p[1]]=p[2];

        while(match($0,/%param_([^%]+)%/,pa)){ if(!sub(/%param_[^%]+%/, param[pa[1]],$0)) break; }
{ print $0; }

$ echo %text text %param_one% text text%param_two% text text text text%param_n%% | ./test.awk param_one=111 param_two=222 param_n=xxx
%text text 111 text text222 text text text textxxx%

I could not get gensub to read param["\\1"] so I, again, used match to save the "\\1" piece.

All times are GMT -5. The time now is 06:37 AM.