LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-03-2010, 09:09 AM   #1
webhope
Member
 
Registered: Apr 2010
Posts: 184

Rep: Reputation: 30
awk - simple example


I found such simple example of awk

Code:
awk '
BEGIN { a = "1abc 2def"
     b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
      print b     }'
However I don't understand why the regular expresion in parentheses doesn't work how I woud expect. If I delete the .+ so it does the same.

Code:
awk '
BEGIN { a = "1abc 2def"
     b = gensub(/() ()/, "\\2 \\1", "g", a)
      print b     }'
I wanted to use a regular expression in parantheses like ([/+[]()]) to get the specific letters.

Code:
awk '
BEGIN { a = "1abc 2def"
     b = gensub(/([/+[]()])/, "//\\1", "g", a)
      print b     }'
But why the regexp in () isn't working

Last edited by webhope; 05-03-2010 at 09:10 AM.
 
Old 05-03-2010, 12:23 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
To me the first two examples are not the same:
Code:
$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
>       print b     }'
2def 1abc
This matches two strings made of one or more characters separated by space. Ok.
Code:
$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/() ()/, "\\2 \\1", "g", a)
>       print b     }'
1abc 2def
This matches a space embedded by "null strings" and replaces it with a space embedded by the two matched "null string". Just to say it does nothing. To better see it consider the following two examples:
Code:
$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/() ()/, "XXX", "g", a)
>       print b     }'
1abcXXX2def
here the space is replaced by the string constant "XXX". And
Code:
$ awk '
> BEGIN { a = "1abc 2def"
>      b = gensub(/()/, "X", "g", a)
>       print b     }'
X1XaXbXcX X2XdXeXfX
Here any "null string" is replaced by "X". Not sure about your last example. Please, can you explain what do you want to achieve a bit more?
 
Old 05-03-2010, 01:43 PM   #3
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by colucix View Post
To me the first two examples are not the same...
Not sure about your last example. Please, can you explain what do you want to achieve a bit more?
Thanks for answering. I read about 4 awk manuals but about parenthesis and how to work with them not much informations. I need to escape same characters. The characters are in block of text (variable). I want to replace / * + ( ) [ ] for \/ \* \+ \( \) \[ \]

Code:
awk 'BEGIN { a = " / * + ( ) [ ] "
b = gensub(/([/*+()[]])/, "\\/1", "g", a)
print b     }'
Nowhere I see using character class in parantheses so don't know if is it correct. I would expect that the command should find one of the characters of class and to add a backslash before it.

Edit:
You surprised me that I can operate with "null string" in this way. The last example of you is clever.

Last edited by webhope; 05-03-2010 at 01:45 PM.
 
Old 05-03-2010, 05:12 PM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Well.. first we have to analyze the replacement string. Take a look at the caveats about literal backslashes and ampersands as explained in the GNU awk user's guide, here. In particular look at "Table 8.5 Escape Sequence Processing for gensub". It states you can obtain a literal backslash followed by the matched text using \\\\&.

Let's try a simple example: we want to match a literal asterisk:
Code:
$ awk 'BEGIN { a = "This is an asterisk *"
> b = gensub(/(*)/, "\\\\&", "g", a)
> print b }'
This is an asterisk \*
It works. Now we want to match an asterisk and a plus sign. We use a character list now:
Code:
$ awk 'BEGIN { a = "These are an asterisk * and a plus +"
> b = gensub(/([*+])/, "\\\\&", "g", a)
> print b }'
These are an asterisk \* and a plus \+
Ok. Let's add a slash:
Code:
$ awk 'BEGIN { a = "Here we go * + /"
> b = gensub(/([*+/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/
This shows us that a slash inside the character list does not act as closing slash for the regular expression. We are lucky.

Now the difficult part: square brackets. We have to be careful because if we put a closing square bracket in the wrong place, awk might think we want to close the character list. Let's try:
Code:
$ awk 'BEGIN { a = "Here we go * + / ]"
> b = gensub(/([*]+/])/, "\\\\&", "g", a)
> print b }'
awk: cmd. line:1: fatal: Unmatched ( or \(: /([*]+/
Naah... wrong place. It thinks we have closed the character list, so that the following slash closes the regular expression and the first open parenthesis ( remains unmatched.
Code:
$ awk 'BEGIN { a = "Here we go * + / ]"
> b = gensub(/([]*+/])/, "\\\\&", "g", a)
> print b }'
awk: cmd. line:1: fatal: Unmatched [ or [^: /([]*+/
This is weird. Placed at the beginning of the character list, the closing bracket should be interpreted literally, instead we have an unmatched [. What if we escape it?
Code:
$ awk 'BEGIN { a = "Here we go * + / ]"
b = gensub(/([\]*+/])/, "\\\\&", "g", a)
print b }'
Here we go \* \+ \/ \]
Hey... this seems to work! But pay attention to the following. We don't escape the closing square bracket but we add an opening one somewhere inside the character list:
Code:
$ awk 'BEGIN { a = "Here we go * + / ] ["
> b = gensub(/([]*+[/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/ \] \[
This works. The two square brackets inside the character list are interpreted literally and we have matched the opening bracket at the same time. We have fooled awk!

Now the parentheses, but I prefer to leave you the pleasure (?) to find out the caveats, if you don't mind. Here is one of the working solutions:
Code:
awk 'BEGIN { a = "Here we go * + / ] [ ( )"
> b = gensub(/([])(*+[/])/, "\\\\&", "g", a)
> print b }'
Here we go \* \+ \/ \] \[ \( \)
In any case, take in mind that escaping square brackets and parentheses inside a character list is the most straightforward solution. Maybe.
 
Old 05-03-2010, 05:57 PM   #5
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
I again have some problems with understanding. First I don't understand term caveat(s). Any synonym?

I read but still didn't understand what written there:
Table 8.4: POSIX 2001 rules for sub

Ok, now I had problem to understand the things about \\\\&
I didn't understand you sentence: "It states you can obtain a literal backslash ... \\\\&." But I think I understand now. So the \\\\ gets backslash and the & is like a reference to the content of parentheses? To a character or more characters in parentheses?

gensub(/(*)/, "\\\\&", "g"

I will study your next examples tommorow. I watched them now, but I am tired so my mind is tired to understand it. I would use something like ([a-zA-Z[]]) , but is it correct or is it interpreted as closed by the bold bracket? Or better your way ([[a-zA-Z]])
 
Old 05-04-2010, 12:26 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Quote:
First I don't understand term caveat(s). Any synonym?
http://dictionary.reference.com/browse/caveat

Quote:
Ok, now I had problem to understand the things about \\\\&
I didn't understand you sentence: "It states you can obtain a literal backslash ... \\\\&." But I think I understand now. So the \\\\ gets backslash and the & is like a reference to the content of parentheses? To a character or more characters in parentheses?
1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1
2. & - in the case of sub, gsub or gensub if & is used in the replacement then it is equal to whatever is matched in regex - /<whatever is in here>/ - & = <whatever is in here>
3. \\\\ - four slashes are required as two (\\) would equate to a single \ however we want to have a backslash in our output and this will only provide as an escape of the
next character, in your case the &, so it would equal - \&, not what you want. On the other hand, \\\\ means you will end up with \\&, so an escaped slash plus your ampersand
which as above is a copy of your regex

Hope that helps.
 
1 members found this post helpful.
Old 05-04-2010, 02:56 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by grail View Post
1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1
Correct. Thank you for the clarification.
 
Old 05-04-2010, 03:18 AM   #8
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by grail View Post
http://dictionary.reference.com/browse/caveat
1. If you are not using the numbered locators (for example, "\\1") you do not need the round brackets in regex - /(*)/ - only need () if you are going to use \\1
2. & - in the case of sub, gsub or gensub if & is used in the replacement then it is equal to whatever is matched in regex - /<whatever is in here>/ - & = <whatever is in here>
I found this post very useful. Thanks

Last edited by webhope; 05-04-2010 at 03:19 AM.
 
Old 05-04-2010, 03:36 AM   #9
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by colucix View Post
We don't escape the closing square bracket but we add an opening one somewhere inside the character list:
Code:
b = gensub(/([]*+[/])/, "\\\\&", "g", a)
This works. The two square brackets inside the character list are interpreted literally and we have matched the opening bracket at the same time. We have fooled awk!
Hey, I don't understand it how this can work. I would expect the brackets [] would be interpreted like beginning and end of character class nothing containing.

Does it mean that if right bracket ] is right beside left bracket [ , so it interprets like normal character? Then the second left [ bracket is interpreted as normal character and the second right bracket ] is interpreted as end of char. class.
 
Old 05-04-2010, 04:03 AM   #10
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
I reworked the example, but I can't get the external variable to the awk
Code:
block_new="   /   +  (      ) [ ] "  
block_new=$( echo $block_new |  awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }')
echo $block_new
Code:
awk: BEGIN { a="
awk:           ^ unterminated string
Or

Code:
block_new="   /   +  (      ) [ ] "  
block_new=$( awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }')
echo $block_new
How to do it if I want to get the variable $block_new and then to access the variable in bash?

Last edited by webhope; 05-04-2010 at 06:00 AM.
 
Old 05-04-2010, 06:07 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Change the order of your quotes:
Code:
a='"$<variable>"'
There was a thread a little while back on LQ about character classes and how to include square brackets []
What was discovered is that as long as the first character after opening square is the closing square that it then perceives this as an item
and not the closing bracket, whereas all other items were the same as other regex options in a character class.
Therefore, as long as it starts with:
Code:
/[]<other stuff here>]/
Then it seems ok.
 
1 members found this post helpful.
Old 05-04-2010, 06:12 AM   #12
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
Thank you for help, I almost go crazy

Last edited by webhope; 05-04-2010 at 06:29 AM.
 
Old 05-04-2010, 06:23 AM   #13
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
But

Code:
awk 'BEGIN { a='"$block_new"'
> b = gensub(/([\]*+/])/, "\\\\&", "g", a)
> print b }'
awk: BEGIN { a=   /   +  (      ) [ ]
awk:               ^ unterminated regexp
[root@localhost mail]#
Are you sure the '"$<variable>"' is correct? It shows no ""

Or this
Code:
block_new="   /   +  (      ) [ ] "  
awk "BEGIN { a='"$block_new"'
b = gensub(/([\]*+/])/, "\\\\&", "g", a)
print b }"
results to:
Code:
awk "BEGIN { a='"$block_new"'
> b = gensub(/([\]*+/])/, "\\\\&", "g", a)
> print b }"
awk: BEGIN { a='
awk:           ^ invalid char ''' in expression
[1] 30706
bash: , g, a)
print b }: command not found
[1]+  Exit 1                  awk "BEGIN { a='"$block_new"'
b = gensub(/([\]*+/])/, "\\\\
[root@localhost mail]#
This is really crazy!
Code:
block_new="   /   +  (      ) [ ] "  
awk 'BEGIN { a="'$block_new'"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }'
+ awk 'BEGIN { a="' / + '(' ')' '[' ']' '"; b = gensub(/([])(*+[/])/, "\\\\&", "g", a); print b }'
awk: BEGIN { a="
awk:           ^ unterminated string
Why the external variable is interpreted as ' / + '(' ')' '[' ']' ' That breaks the awk pattern

Last edited by webhope; 05-04-2010 at 06:36 AM.
 
Old 05-04-2010, 07:54 AM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Turns out it needs to be quoted both in and out of the awk:
Code:
echo | awk 'BEGIN{a="'"$block"'"}{b=gensub(/[]*+\/]/,"\\\\&","g",a);print b}'
 
1 members found this post helpful.
Old 05-04-2010, 09:00 AM   #15
webhope
Member
 
Registered: Apr 2010
Posts: 184

Original Poster
Rep: Reputation: 30
Thumbs up

Thanks, this was helpful!!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] (Not so) Simple awk question. bezoar Linux - General 2 09-15-2009 08:27 PM
Simple (?) awk, two delimiters int0x80 Programming 3 02-25-2009 08:53 AM
simple awk question getline coldy78 Programming 3 04-20-2007 11:39 PM
simple awk question mr_scary Linux - General 3 02-23-2007 06:37 PM
Simple question about sed or awk setianusa Programming 2 09-16-2005 03:57 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:40 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration