[SOLVED] Small awk script: critiques sought

catkin · 03-18-2010, 11:47 AM

Hello

I'm not fluent in awk and would like to improve so seek critiques of this effort: stylistic, functional, whatever ...

Code:

#!/usr/bin/awk -f

BEGIN {
    FS = "="
    tab = " "
}

/^[     ]*[aA][a-zA-Z   ]*=/ {              # Match regex first [ ] is space and tab
    keyword = $1
    gsub( "[ "tab"'\"]*", "", keyword)      # Remove any spaces and tabs
    keyword = tolower( keyword )
    if ( keyword == "archivedevice" ) {
        data = $2
        sub( "#.*", "", data )              # Strip comment (ignoring possibility of # in a quoted value)
        sub( "^[ "tab"'\"]*", "", data )    # Strip leading spaces, tabs and quotes
        sub( "[ "tab"'\"]*$", "", data )    # Strip trailing spaces, tabs and quotes
        print data
    }
}

The purpose is to extract an "Archive Device" from a Bacula Storage Daemon configuration file. Bacula configuration files include lines of the format "keyword = value". Exotically all spaces (and tabs?) and capitalisation in the keywords are ignored. In this case "Archive Device" could be validly represented as "ARCHIVE DEVICE", ArchiveDevice or even "A r c h i VE device". Values may be quoted. Spaces (and tabs?) around values are ignored except as separators. In this case there can be only one value. Comments can appear anywhere, introduced by a #.

The file is assumed syntactically correct so there is no need to check for unbalanced quotes etc.. I have chosen to ignore the possibility of a quoted value including a #.

The variable "tab" is used to improve legibility. Can it be used within the match regex, too?

Could the task have been done more elegantly using sed?

EDIT: corrected Storage Director to Storage Daemon.

Best

Charles

colucix · 03-18-2010, 12:46 PM

Hi Charles! Sincerely I see some redundancy in your code. For example, if you assign the first field to the variable "keyword" inside the action and check it using the "if" statement, the matching regular expression is not necessary (nor I think it can improve the speed of the script or at least not so much).

Second, if using GNU awk you can set case-insensitivity by assigning a value different than 0 to the built-in variable IGNORECASE. This will spare the need for the tolower statement and the doubled character lists, as [aA].

Finally, you can try to use a single regular expression to match the leading (and trailing) spaces, tabs, quotes and hashes.

Here is what I would try:

Code:

BEGIN {
    FS = "="
    IGNORECASE = 1
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[[:space:]]/,"",$1)
  
    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing leading and trailing spaces, tabs, double quotes and hashes
    #
    if ( $1 ~ "archivedevice" )
       print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2) 
}

Indeed, this does not meet the requirement for readibility, but just to give you an idea of the power of GNU awk language. Hopefully, let's wait for opinions by some real awk guru. Ghostdog, are you there?!

All the best,
Alex

ghostdog74 · 03-18-2010, 10:12 PM

with GNU awk, you can use regex in FS

Code:

$ cat file
Device {
  Name = Floppy
  Media Type = Floppy
  Archive Device = /mnt/floppy
  Archive dEvIce           =                                    /dev/sr0
Archive dEvIce=/dev/sr10
      A r c h i VE device     =  /dev/cdrom
  RemovableMedia = yes;
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = no;
}

$ awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub(" +","",$1)}$1~/archivedevice/{print $2}' file
/mnt/floppy
/dev/sr0
/dev/sr10
/dev/cdrom

grail · 03-18-2010, 11:52 PM

@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me

Did anyone else have success with this one?

ghostdog74 · 03-19-2010, 12:27 AM

Quote:

Originally Posted by grail

@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me

Did anyone else have success with this one?

did you use gawk? and i have revised a bit to take care of tabs as well.

grail · 03-19-2010, 12:57 AM

My bad :$ I was using the last example as my test (A r c h i VE device) which of course failed as 'archive'
was not all together.

So was Alex's reply of - print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2)
the best alternative to get rid of any extra guff?

ghostdog74 · 03-19-2010, 01:09 AM

Quote:

Originally Posted by grail

My bad :$ I was using the last example as my test (A r c h i VE device) which of course failed as 'archive'
was not all together.

So was Alex's reply of - print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2)
the best alternative to get rid of any extra guff?

i doubt you would want to name it like that. however still, you can remove all the spaces and check against one whole string. see my edit

grail · 03-19-2010, 01:48 AM

Ok, so M2C:

Code:

awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub("[ \t]+|#.*$","",$0)}$1~/archivedevice/{print $2}' file

This also gets rid of the remarks at the end or remarked lines.
Happy to know if using $0 is considered dangerous or wrong in this case?

catkin · 03-19-2010, 03:43 AM

Thanks all for your help and interest

I've created a test input and attached it (because it loses tabs when pasted). Here it is pasted for illustration

Code:

# Test input for ArchiveDevice parsing programs

# Non-canonical values numbered for ease of output checking

Device {
  Name = Floppy
  Media Type = Floppy
  ArchiveDevice = /mnt/floppy   # Canonical
ArchivedEvIce=1./dev/sr0# Mixed case,no whitespace
     A   r c h i VE device       =      2./dev/cdrom    # Mixed case, aberrant whitespace
     A   r c h i VE device       =      "3./dev/cdrom"  # Mixed case, aberrant whitespace, quoted value
     A   r c h i VE device       =      "4.     /dev/cdrom   X"     # Mixed case, aberrant whitespace, quoted value-with-whitespace
     A   r c h i VE device       =      "5.     #/dev/cdrom  X"     # Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
  RemovableMedia = yes;
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = no;
}

Case 5 is just for the challenge!

The expected output is

Code:

/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4.     /dev/cdrom   X
5.     #/dev/cdrom  X

Tested with this input, none of the suggestions pass as many tests as the OP script (which passes all except test 5).

Regards style, maintainability is more important than terseness (although it is technically challenging, educational and fun to aim for that minimal one-liner). Hence my use of variables "keyword" and "data"; not necessary but they help toward "self-documenting code".

Regards performance, it is not an issue. This snippet is part of a bash script with projected run times from tens of minutes to an hour or so. Hence it was a good idea to drop my original match regex and work on every line.

EDIT:

Dropping variable "tab" in favour of using "\t" is sweet; it reduces clutter while maintaining legibility. Changing FS to "[ \t]*=[ \t]*" is also sweet, stripping trailing whitespace from the keyword and leading whitespace from the data. Perhaps it could be extended to strip any leading quote from the data "[ \t]*=[ \t]*['\"]?"

The test file is incomplete. It does not include single quotes or the aberrant case of = in the data.

colucix · 03-19-2010, 06:04 AM

I forgot the possibility to use regexp in FS.. nice trick!

Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?

Jokes apart, I'd like to discuss about the 5th case. I would assume that any aberrant comment inside the value (data) must be at least embedded in quotes, otherwise any interpreter would consider it as a real comment and ignore the rest. Made this assumption, I cannot find any regular expression to match only the comments outside quotes. For this reason, I'd protect it from the regexp that removes trailing comments, by substituting it with something else, then restoring it later. For example:

Code:

BEGIN {
    FS = "="
    IGNORECASE = 1
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[[:space:]]/,"",$1)
    
    #
    #  Protect hash inside the value
    #
    if ( $2 ~ /".*#.*"/ )
       sub(/#/,"PROTECT",$2)
       
    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing:
    #  1. trailing comments
    #  2. leading and trailing spaces, tabs and double quotes.
    #  Also restore protected hash inside the value
    #
    if ( $1 ~ "archivedevice" ) {
       sub(/#.*$/,"",$2)
       sub(/PROTECT/,"#",$2)
       print gensub(/^[ \t"]+|[ \t"]+$/,"","g",$2), NR
    }
}

Please note that in my previous post, I forgot to remove entire comments from the end of the line. I've added it here.

Since I keep an eye for compatibility, another clue is: what if we don't run this on GNU awk? In this case I'd "translate" the code to something more similar to your original one:

Code:

BEGIN {
    FS = "="
}

{
    #
    #  Remove space characters from the keyword
    #
    gsub(/[ \t]/,"",$1)
    
    #
    #  Protect hash inside the value
    #
    if ( $2 ~ /".*#.*"/ )
       sub(/#/,"PROTECT",$2)

    #
    #  If the keyword matches what we are looking for, print the value after
    #  removing:
    #  1. trailing comments
    #  2. leading and trailing spaces, tabs and double quotes.
    #  Also restore protected hash inside the value
    #
    if ( tolower($1) ~ "archivedevice" ) {
       sub(/#.*$/,"",$2)
       sub(/PROTECT/,"#",$2)
       sub(/^[ \t"]+/,"",$2)
       sub(/[ \t"]+$/,"",$2)
       print $2
    }
}

this works even in nawk on a old Solaris Sparc (tested). As you can see, it not differ much from yours (except for the explicit assignment of "keyword" and "data"), hence my final comment is reduced to the not strictly necessary matching regexp.

Cheers!

grail · 03-19-2010, 08:18 AM

Well I haven't found a winner for 5 yet, but the addition of \" in the gsub got me the rest.
My query would be whether or not '4. /dev/cdrom X' would be correct?
I could be wrong but as we are talking devices I assume the X should be attached, ie /dev/cdromX
So that was my output:

/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4./dev/cdromX
5.

Using:

Code:

awk 'BEGIN{FS="[ \t]*=[ \t]*";IGNORECASE=1}{gsub("[ \t\"]+|#.*$","",$0)}$1~/archivedevice/{print $2}' ArchiveDevice.test.in.txt

btw. I copied catkin's original code above and it only returned the first 2 correct entries for me (although I have discovered
Ubuntu only has mawk so not sure if that causes issues (apart from not support IGNORECASE))

colucix · 03-19-2010, 08:35 AM

Quote:

Originally Posted by grail

My query would be whether or not '4. /dev/cdrom X' would be correct?

Good catch. Indeed the attached test file is a non-sense. I meant it as an exercise for testing code, but I doubt someone would really put "#" and numbers in the device field.

grail · 03-19-2010, 09:32 AM

Hey colucix, I tried to implement your PROTECT idea ( I am still learning ) but when the sub is finished
if I print $0 it shows me that the fifth entry has now had the "=" removed. Simplified it to not worry about the
final output and just print in the if ... can someone tell me what I am doing wrong?

Code:

#!/usr/bin/awk -f

BEGIN {
	FS="[ \t]*=[ \t]*"
	IGNORECASE=1
}

{
	if($2 ~ /".*#.*"/){
		print $0
		sub(/#/,"PROTECT",$2)
		print $0
	}
}

Using the file from catkin I get the following:

Code:

	 A	 r c h i VE device	     =  	"5. 	#/dev/cdrom	 X" 	# Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
	 A	 r c h i VE device "5. 	PROTECT/dev/cdrom	 X" 	# Mixed case, aberrant whitespace, quoted value-with-whitespace-and#

Notice the line with PROTECT (second print $0) now has no = and has lost some white space.
Any clues??

colucix · 03-19-2010, 09:56 AM

Yes. The = sign is part of FS. When you alter a field using sub, the whole record is "re-built" using OFS (default is a space) as field separator. Hence all the content matching the input FS is lost.

To make an example:

Code:

$ echo one=two | awk -F= '{print $0}'
one=two
$ echo one=two | awk -F= '{print $1,$2}'
one two
$ echo one=two | awk -F= 'BEGIN{OFS = "="} {print $1,$2}'
one=two

catkin · 03-20-2010, 01:17 AM

Quote:

Originally Posted by colucix

Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?

Hello Alex

Extreme, I know, but if the input file is valid for Bacula, it should be valid for the script and there are elements of "if you learn how to do it right once, it's easy to do it right afterwards" and familiarising with the power of awk you mentioned in your first post.

Actually it just got more extreme because Bacula accepts backslash escapes in quoted strings so there's a new test line:

Code:

ArchiveDevice = "6.   #/dev/c\drom  \\ \"  X"     # Quoted value with whitespace and # and backslash escapes