LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-18-2010, 11:47 AM   #1
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Small awk script: critiques sought


Hello

I'm not fluent in awk and would like to improve so seek critiques of this effort: stylistic, functional, whatever ...
Code:
#!/usr/bin/awk -f

BEGIN {
    FS = "="
    tab = " "
}

/^[     ]*[aA][a-zA-Z   ]*=/ {              # Match regex first [ ] is space and tab
    keyword = $1
    gsub( "[ "tab"'\"]*", "", keyword)      # Remove any spaces and tabs
    keyword = tolower( keyword )
    if ( keyword == "archivedevice" ) {
        data = $2
        sub( "#.*", "", data )              # Strip comment (ignoring possibility of # in a quoted value)
        sub( "^[ "tab"'\"]*", "", data )    # Strip leading spaces, tabs and quotes
        sub( "[ "tab"'\"]*$", "", data )    # Strip trailing spaces, tabs and quotes
        print data
    }
}
The purpose is to extract an "Archive Device" from a Bacula Storage Daemon configuration file. Bacula configuration files include lines of the format "keyword = value". Exotically all spaces (and tabs?) and capitalisation in the keywords are ignored. In this case "Archive Device" could be validly represented as "ARCHIVE DEVICE", ArchiveDevice or even "A r c h i VE device". Values may be quoted. Spaces (and tabs?) around values are ignored except as separators. In this case there can be only one value. Comments can appear anywhere, introduced by a #.

The file is assumed syntactically correct so there is no need to check for unbalanced quotes etc.. I have chosen to ignore the possibility of a quoted value including a #.

The variable "tab" is used to improve legibility. Can it be used within the match regex, too?

Could the task have been done more elegantly using sed?

EDIT: corrected Storage Director to Storage Daemon.

Best

Charles

Last edited by catkin; 03-18-2010 at 12:02 PM.
 
Old 03-18-2010, 12:46 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Hi Charles! Sincerely I see some redundancy in your code. For example, if you assign the first field to the variable "keyword" inside the action and check it using the "if" statement, the matching regular expression is not necessary (nor I think it can improve the speed of the script or at least not so much).

Second, if using GNU awk you can set case-insensitivity by assigning a value different than 0 to the built-in variable IGNORECASE. This will spare the need for the tolower statement and the doubled character lists, as [aA].

Finally, you can try to use a single regular expression to match the leading (and trailing) spaces, tabs, quotes and hashes.

Here is what I would try:
Code:
BEGIN {
    FS = "="
    IGNORECASE = 1
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[[:space:]]/,"",$1)
  
    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing leading and trailing spaces, tabs, double quotes and hashes
    #
    if ( $1 ~ "archivedevice" )
       print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2) 
}
Indeed, this does not meet the requirement for readibility, but just to give you an idea of the power of GNU awk language. Hopefully, let's wait for opinions by some real awk guru. Ghostdog, are you there?!

All the best,
Alex
 
Old 03-18-2010, 10:12 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
with GNU awk, you can use regex in FS
Code:
$ cat file
Device {
  Name = Floppy
  Media Type = Floppy
  Archive Device = /mnt/floppy
  Archive dEvIce           =                                    /dev/sr0
Archive dEvIce=/dev/sr10
      A r c h i VE device     =  /dev/cdrom
  RemovableMedia = yes;
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = no;
}

$ awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub(" +","",$1)}$1~/archivedevice/{print $2}' file
/mnt/floppy
/dev/sr0
/dev/sr10
/dev/cdrom

Last edited by ghostdog74; 03-19-2010 at 01:08 AM.
 
Old 03-18-2010, 11:52 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me
Did anyone else have success with this one?
 
Old 03-19-2010, 12:27 AM   #5
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by grail View Post
@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me
Did anyone else have success with this one?
did you use gawk? and i have revised a bit to take care of tabs as well.
 
Old 03-19-2010, 12:57 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
My bad :$ I was using the last example as my test (A r c h i VE device) which of course failed as 'archive'
was not all together.

So was Alex's reply of - print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2)
the best alternative to get rid of any extra guff?
 
Old 03-19-2010, 01:09 AM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by grail View Post
My bad :$ I was using the last example as my test (A r c h i VE device) which of course failed as 'archive'
was not all together.

So was Alex's reply of - print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2)
the best alternative to get rid of any extra guff?
i doubt you would want to name it like that. however still, you can remove all the spaces and check against one whole string. see my edit
 
Old 03-19-2010, 01:48 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Ok, so M2C:

Code:
awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub("[ \t]+|#.*$","",$0)}$1~/archivedevice/{print $2}' file
This also gets rid of the remarks at the end or remarked lines.
Happy to know if using $0 is considered dangerous or wrong in this case?
 
Old 03-19-2010, 03:43 AM   #9
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Thanks all for your help and interest

I've created a test input and attached it (because it loses tabs when pasted). Here it is pasted for illustration
Code:
# Test input for ArchiveDevice parsing programs

# Non-canonical values numbered for ease of output checking

Device {
  Name = Floppy
  Media Type = Floppy
  ArchiveDevice = /mnt/floppy   # Canonical
ArchivedEvIce=1./dev/sr0# Mixed case,no whitespace
     A   r c h i VE device       =      2./dev/cdrom    # Mixed case, aberrant whitespace
     A   r c h i VE device       =      "3./dev/cdrom"  # Mixed case, aberrant whitespace, quoted value
     A   r c h i VE device       =      "4.     /dev/cdrom   X"     # Mixed case, aberrant whitespace, quoted value-with-whitespace
     A   r c h i VE device       =      "5.     #/dev/cdrom  X"     # Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
  RemovableMedia = yes;
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = no;
}
Case 5 is just for the challenge!

The expected output is
Code:
/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4.     /dev/cdrom   X
5.     #/dev/cdrom  X
Tested with this input, none of the suggestions pass as many tests as the OP script (which passes all except test 5).

Regards style, maintainability is more important than terseness (although it is technically challenging, educational and fun to aim for that minimal one-liner). Hence my use of variables "keyword" and "data"; not necessary but they help toward "self-documenting code".

Regards performance, it is not an issue. This snippet is part of a bash script with projected run times from tens of minutes to an hour or so. Hence it was a good idea to drop my original match regex and work on every line.

EDIT:

Dropping variable "tab" in favour of using "\t" is sweet; it reduces clutter while maintaining legibility. Changing FS to "[ \t]*=[ \t]*" is also sweet, stripping trailing whitespace from the keyword and leading whitespace from the data. Perhaps it could be extended to strip any leading quote from the data "[ \t]*=[ \t]*['\"]?"

The test file is incomplete. It does not include single quotes or the aberrant case of = in the data.
Attached Files
File Type: txt ArchiveDevice.test.in.txt (806 Bytes, 9 views)

Last edited by catkin; 03-19-2010 at 03:53 AM.
 
Old 03-19-2010, 06:04 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
I forgot the possibility to use regexp in FS.. nice trick!

Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?

Jokes apart, I'd like to discuss about the 5th case. I would assume that any aberrant comment inside the value (data) must be at least embedded in quotes, otherwise any interpreter would consider it as a real comment and ignore the rest. Made this assumption, I cannot find any regular expression to match only the comments outside quotes. For this reason, I'd protect it from the regexp that removes trailing comments, by substituting it with something else, then restoring it later. For example:
Code:
BEGIN {
    FS = "="
    IGNORECASE = 1
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[[:space:]]/,"",$1)
    
    #
    #  Protect hash inside the value
    #
    if ( $2 ~ /".*#.*"/ )
       sub(/#/,"PROTECT",$2)
       
    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing:
    #  1. trailing comments
    #  2. leading and trailing spaces, tabs and double quotes.
    #  Also restore protected hash inside the value
    #
    if ( $1 ~ "archivedevice" ) {
       sub(/#.*$/,"",$2)
       sub(/PROTECT/,"#",$2)
       print gensub(/^[ \t"]+|[ \t"]+$/,"","g",$2), NR
    }
}
Please note that in my previous post, I forgot to remove entire comments from the end of the line. I've added it here.

Since I keep an eye for compatibility, another clue is: what if we don't run this on GNU awk? In this case I'd "translate" the code to something more similar to your original one:
Code:
BEGIN {
    FS = "="
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[ \t]/,"",$1)
    
    #
    #  Protect hash inside the value
    #
    if ( $2 ~ /".*#.*"/ )
       sub(/#/,"PROTECT",$2)

    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing:
    #  1. trailing comments
    #  2. leading and trailing spaces, tabs and double quotes.
    #  Also restore protected hash inside the value
    #
    if ( tolower($1) ~ "archivedevice" ) {
       sub(/#.*$/,"",$2)
       sub(/PROTECT/,"#",$2)
       sub(/^[ \t"]+/,"",$2)
       sub(/[ \t"]+$/,"",$2)
       print $2
    }
}
this works even in nawk on a old Solaris Sparc (tested). As you can see, it not differ much from yours (except for the explicit assignment of "keyword" and "data"), hence my final comment is reduced to the not strictly necessary matching regexp.

Cheers!

Last edited by colucix; 03-19-2010 at 06:06 AM.
 
Old 03-19-2010, 08:18 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well I haven't found a winner for 5 yet, but the addition of \" in the gsub got me the rest.
My query would be whether or not '4. /dev/cdrom X' would be correct?
I could be wrong but as we are talking devices I assume the X should be attached, ie /dev/cdromX
So that was my output:

/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4./dev/cdromX
5.

Using:
Code:
awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub("[ \t\"]+|#.*$","",$0)}$1~/archivedevice/{print $2}' ArchiveDevice.test.in.txt
btw. I copied catkin's original code above and it only returned the first 2 correct entries for me (although I have discovered
Ubuntu only has mawk so not sure if that causes issues (apart from not support IGNORECASE))
 
Old 03-19-2010, 08:35 AM   #12
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by grail View Post
My query would be whether or not '4. /dev/cdrom X' would be correct?
Good catch. Indeed the attached test file is a non-sense. I meant it as an exercise for testing code, but I doubt someone would really put "#" and numbers in the device field.
 
Old 03-19-2010, 09:32 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Hey colucix, I tried to implement your PROTECT idea ( I am still learning ) but when the sub is finished
if I print $0 it shows me that the fifth entry has now had the "=" removed. Simplified it to not worry about the
final output and just print in the if ... can someone tell me what I am doing wrong?

Code:
#!/usr/bin/awk -f

BEGIN {
	FS="[ \t]*=[ \t]*"
	IGNORECASE=1
}

{
	if($2 ~ /".*#.*"/){
		print $0
		sub(/#/,"PROTECT",$2)
		print $0
	}
}
Using the file from catkin I get the following:

Code:
	 A	 r c h i VE device	     =  	"5. 	#/dev/cdrom	 X" 	# Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
	 A	 r c h i VE device "5. 	PROTECT/dev/cdrom	 X" 	# Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
Notice the line with PROTECT (second print $0) now has no = and has lost some white space.
Any clues??
 
Old 03-19-2010, 09:56 AM   #14
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Yes. The = sign is part of FS. When you alter a field using sub, the whole record is "re-built" using OFS (default is a space) as field separator. Hence all the content matching the input FS is lost.

To make an example:
Code:
$ echo one=two | awk -F= '{print $0}'
one=two
$ echo one=two | awk -F= '{print $1,$2}'
one two
$ echo one=two | awk -F= 'BEGIN{OFS = "="} {print $1,$2}'
one=two
 
Old 03-20-2010, 01:17 AM   #15
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by colucix View Post
Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?
Hello Alex

Extreme, I know, but if the input file is valid for Bacula, it should be valid for the script and there are elements of "if you learn how to do it right once, it's easy to do it right afterwards" and familiarising with the power of awk you mentioned in your first post.

Actually it just got more extreme because Bacula accepts backslash escapes in quoted strings so there's a new test line:
Code:
ArchiveDevice = "6.   #/dev/c\drom  \\ \"  X"     # Quoted value with whitespace and # and backslash escapes
 
  


Reply

Tags
awk, bacula



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash script is enterpreting $1, $2 values in awk script ... praveen_218 Programming 4 09-14-2009 03:38 PM
awk in script Jurrian Linux - Newbie 13 10-30-2008 07:09 PM
what does this awk script do? sharathkv25 Programming 3 03-08-2007 03:10 PM
Passing variables from AWK script to my shell script BigLarry Programming 1 06-12-2004 04:32 AM
sed or awk question - replace caps with small letters computera Linux - General 1 12-30-2003 04:39 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:53 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration