[SOLVED] awk

webhope · 05-03-2010, 09:09 AM

I found such simple example of awk

Code:

awk '
BEGIN { a = "1abc 2def"
     b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
      print b     }'

However I don't understand why the regular expresion in parentheses doesn't work how I woud expect. If I delete the .+ so it does the same.

Code:

awk '
BEGIN { a = "1abc 2def"
     b = gensub(/() ()/, "\\2 \\1", "g", a)
      print b     }'

I wanted to use a regular expression in parantheses like ([/+[]()]) to get the specific letters.

Code:

awk '
BEGIN { a = "1abc 2def"
     b = gensub(/([/+[]()])/, "//\\1", "g", a)
      print b     }'

But why the regexp in () isn't working

colucix · 05-03-2010, 12:23 PM

To me the first two examples are not the same:

Code:

$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
>       print b     }'
2def 1abc

This matches two strings made of one or more characters separated by space. Ok.

Code:

$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/() ()/, "\\2 \\1", "g", a)
>       print b     }'
1abc 2def

This matches a space embedded by "null strings" and replaces it with a space embedded by the two matched "null string". Just to say it does nothing. To better see it consider the following two examples:

Code:

$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/() ()/, "XXX", "g", a)
>       print b     }'
1abcXXX2def

here the space is replaced by the string constant "XXX". And

Code:

$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/()/, "X", "g", a)
>       print b     }'
X1XaXbXcX X2XdXeXfX

Here any "null string" is replaced by "X". Not sure about your last example. Please, can you explain what do you want to achieve a bit more?

webhope · 05-03-2010, 01:43 PM

Quote:

Originally Posted by colucix

To me the first two examples are not the same...
Not sure about your last example. Please, can you explain what do you want to achieve a bit more?

Thanks for answering. I read about 4 awk manuals but about parenthesis and how to work with them not much informations. I need to escape same characters. The characters are in block of text (variable). I want to replace / * + ( ) [ ] for \/ \* \+  \[ \]

Code:

awk 'BEGIN { a = " / * + ( ) [ ] "
b = gensub(/([/*+()[]])/, "\\/1", "g", a)
print b     }'

Nowhere I see using character class in parantheses so don't know if is it correct. I would expect that the command should find one of the characters of class and to add a backslash before it.

Edit:
You surprised me that I can operate with "null string" in this way. The last example of you is clever.

colucix · 05-03-2010, 05:12 PM

Well.. first we have to analyze the replacement string. Take a look at the caveats about literal backslashes and ampersands as explained in the GNU awk user's guide, here. In particular look at "Table 8.5 Escape Sequence Processing for gensub". It states you can obtain a literal backslash followed by the matched text using \\\\&.

Let's try a simple example: we want to match a literal asterisk:

Code:

$ awk 'BEGIN { a = "This is an asterisk *"
> b = gensub(/(*)/, "\\\\&", "g", a)
> print b }'
This is an asterisk \*

It works. Now we want to match an asterisk and a plus sign. We use a character list now:

Code:

$ awk 'BEGIN { a = "These are an asterisk * and a plus +"
> b = gensub(/([*+])/, "\\\\&", "g", a)
> print b }'
These are an asterisk \* and a plus \+

Ok. Let's add a slash:

Code:

$ awk 'BEGIN { a = "Here we go * + /"
> b = gensub(/([*+/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/

This shows us that a slash inside the character list does not act as closing slash for the regular expression. We are lucky.

Now the difficult part: square brackets. We have to be careful because if we put a closing square bracket in the wrong place, awk might think we want to close the character list. Let's try:

Code:

$ awk 'BEGIN { a = "Here we go * + / ]"
> b = gensub(/([*]+/])/, "\\\\&", "g", a)
> print b }'
awk: cmd. line:1: fatal: Unmatched ( or \(: /([*]+/

Naah... wrong place. It thinks we have closed the character list, so that the following slash closes the regular expression and the first open parenthesis ( remains unmatched.

Code:

$ awk 'BEGIN { a = "Here we go * + / ]"
> b = gensub(/([]*+/])/, "\\\\&", "g", a)
> print b }'
awk: cmd. line:1: fatal: Unmatched [ or [^: /([]*+/

This is weird. Placed at the beginning of the character list, the closing bracket should be interpreted literally, instead we have an unmatched [. What if we escape it?

Code:

$ awk 'BEGIN { a = "Here we go * + / ]"
b = gensub(/([\]*+/])/, "\\\\&", "g", a)
print b }'
Here we go \* \+ \/ \]

Hey... this seems to work! But pay attention to the following. We don't escape the closing square bracket but we add an opening one somewhere inside the character list:

Code:

$ awk 'BEGIN { a = "Here we go * + / ] ["
> b = gensub(/([]*+[/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/ \] \[

This works. The two square brackets inside the character list are interpreted literally and we have matched the opening bracket at the same time. We have fooled awk!

Now the parentheses, but I prefer to leave you the pleasure (?) to find out the caveats, if you don't mind. Here is one of the working solutions:

Code:

awk 'BEGIN { a = "Here we go * + / ] [ ( )"
> b = gensub(/([])(*+[/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/ \] \[ \( \)

In any case, take in mind that escaping square brackets and parentheses inside a character list is the most straightforward solution. Maybe.

webhope · 05-03-2010, 05:57 PM

I again have some problems with understanding. First I don't understand term caveat(s). Any synonym?

I read but still didn't understand what written there:
Table 8.4: POSIX 2001 rules for sub

Ok, now I had problem to understand the things about \\\\&
I didn't understand you sentence: "It states you can obtain a literal backslash ... \\\\&." But I think I understand now. So the \\\\ gets backslash and the & is like a reference to the content of parentheses? To a character or more characters in parentheses?

gensub(/(*)/, "\\\\&", "g"

I will study your next examples tommorow. I watched them now, but I am tired so my mind is tired to understand it. I would use something like ([a-zA-Z[]]) , but is it correct or is it interpreted as closed by the bold bracket? Or better your way ([[a-zA-Z]])

grail · 05-04-2010, 12:26 AM

Quote:

First I don't understand term caveat(s). Any synonym?

http://dictionary.reference.com/browse/caveat

Quote:

Ok, now I had problem to understand the things about \\\\&
I didn't understand you sentence: "It states you can obtain a literal backslash ... \\\\&." But I think I understand now. So the \\\\ gets backslash and the & is like a reference to the content of parentheses? To a character or more characters in parentheses?

1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1
2. & - in the case of sub, gsub or gensub if & is used in the replacement then it is equal to whatever is matched in regex - /<whatever is in here>/ - & = <whatever is in here>
3. \\\\ - four slashes are required as two (\\) would equate to a single \ however we want to have a backslash in our output and this will only provide as an escape of the
next character, in your case the &, so it would equal - \&, not what you want. On the other hand, \\\\ means you will end up with \\&, so an escaped slash plus your ampersand
which as above is a copy of your regex

Hope that helps.

colucix · 05-04-2010, 02:56 AM

Quote:

Originally Posted by grail

1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1

Correct. Thank you for the clarification.

webhope · 05-04-2010, 03:18 AM

Quote:

Originally Posted by grail

http://dictionary.reference.com/browse/caveat
1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1
2. & - in the case of sub, gsub or gensub if & is used in the replacement then it is equal to whatever is matched in regex - /<whatever is in here>/ - & = <whatever is in here>

I found this post very useful. Thanks

webhope · 05-04-2010, 03:36 AM

Quote:

Originally Posted by colucix

We don't escape the closing square bracket but we add an opening one somewhere inside the character list:

Code:

b = gensub(/([]*+[/])/, "\\\\&", "g", a)

This works. The two square brackets inside the character list are interpreted literally and we have matched the opening bracket at the same time. We have fooled awk!

Hey, I don't understand it how this can work. I would expect the brackets [] would be interpreted like beginning and end of character class nothing containing.

Does it mean that if right bracket ] is right beside left bracket [ , so it interprets like normal character? Then the second left [ bracket is interpreted as normal character and the second right bracket ] is interpreted as end of char. class.

webhope · 05-04-2010, 04:03 AM

I reworked the example, but I can't get the external variable to the awk

Code:

block_new="   /   +  (      ) [ ] "  
block_new=$( echo $block_new |  awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }')
echo $block_new

Code:

awk: BEGIN { a="
awk:           ^ unterminated string

Or

Code:

block_new="   /   +  (      ) [ ] "  
block_new=$( awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }')
echo $block_new

How to do it if I want to get the variable $block_new and then to access the variable in bash?

grail · 05-04-2010, 06:07 AM

Change the order of your quotes:

Code:

a='"$<variable>"'

There was a thread a little while back on LQ about character classes and how to include square brackets []
What was discovered is that as long as the first character after opening square is the closing square that it then perceives this as an item
and not the closing bracket, whereas all other items were the same as other regex options in a character class.
Therefore, as long as it starts with:

Code:

/[]<other stuff here>]/

Then it seems ok.

webhope · 05-04-2010, 06:12 AM

Thank you for help, I almost go crazy

webhope · 05-04-2010, 06:23 AM

But

Code:

awk 'BEGIN { a='"$block_new"'
> b = gensub(/([\]*+/])/, "\\\\&", "g", a)
> print b }'
awk: BEGIN { a=   /   +  (      ) [ ]
awk:               ^ unterminated regexp
[root@localhost mail]#

Are you sure the '"$<variable>"' is correct? It shows no ""

Or this

Code:

block_new="   /   +  (      ) [ ] "  
awk "BEGIN { a='"$block_new"'
b = gensub(/([\]*+/])/, "\\\\&", "g", a)
print b }"

results to:

Code:

awk "BEGIN { a='"$block_new"'
> b = gensub(/([\]*+/])/, "\\\\&", "g", a)
> print b }"
awk: BEGIN { a='
awk:           ^ invalid char ''' in expression
[1] 30706
bash: , g, a)
print b }: command not found
[1]+  Exit 1                  awk "BEGIN { a='"$block_new"'
b = gensub(/([\]*+/])/, "\\\\
[root@localhost mail]#

This is really crazy!

Code:

block_new="   /   +  (      ) [ ] "  
awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }'
+ awk 'BEGIN { a="' / + '(' ')' '[' ']' '"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }'
awk: BEGIN { a="
awk:           ^ unterminated string

Why the external variable is interpreted as ' / + '(' ')' '[' ']' ' That breaks the awk pattern

grail · 05-04-2010, 07:54 AM

Turns out it needs to be quoted both in and out of the awk:

Code:

echo | awk 'BEGIN{a="'"$block"'"}{b=gensub(/[]*+\/]/,"\\\\&","g",a);print b}'

webhope · 05-04-2010, 09:00 AM

Thanks, this was helpful!!