ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm not fluent in awk and would like to improve so seek critiques of this effort: stylistic, functional, whatever ...
Code:
#!/usr/bin/awk -f
BEGIN {
FS = "="
tab = " "
}
/^[ ]*[aA][a-zA-Z ]*=/ { # Match regex first [ ] is space and tab
keyword = $1
gsub( "[ "tab"'\"]*", "", keyword) # Remove any spaces and tabs
keyword = tolower( keyword )
if ( keyword == "archivedevice" ) {
data = $2
sub( "#.*", "", data ) # Strip comment (ignoring possibility of # in a quoted value)
sub( "^[ "tab"'\"]*", "", data ) # Strip leading spaces, tabs and quotes
sub( "[ "tab"'\"]*$", "", data ) # Strip trailing spaces, tabs and quotes
print data
}
}
The purpose is to extract an "Archive Device" from a Bacula Storage Daemon configuration file. Bacula configuration files include lines of the format "keyword = value". Exotically all spaces (and tabs?) and capitalisation in the keywords are ignored. In this case "Archive Device" could be validly represented as "ARCHIVE DEVICE", ArchiveDevice or even "A r c h i VE device". Values may be quoted. Spaces (and tabs?) around values are ignored except as separators. In this case there can be only one value. Comments can appear anywhere, introduced by a #.
The file is assumed syntactically correct so there is no need to check for unbalanced quotes etc.. I have chosen to ignore the possibility of a quoted value including a #.
The variable "tab" is used to improve legibility. Can it be used within the match regex, too?
Could the task have been done more elegantly using sed?
EDIT: corrected Storage Director to Storage Daemon.
Hi Charles! Sincerely I see some redundancy in your code. For example, if you assign the first field to the variable "keyword" inside the action and check it using the "if" statement, the matching regular expression is not necessary (nor I think it can improve the speed of the script or at least not so much).
Second, if using GNU awk you can set case-insensitivity by assigning a value different than 0 to the built-in variable IGNORECASE. This will spare the need for the tolower statement and the doubled character lists, as [aA].
Finally, you can try to use a single regular expression to match the leading (and trailing) spaces, tabs, quotes and hashes.
Here is what I would try:
Code:
BEGIN {
FS = "="
IGNORECASE = 1
}
{
#
# Remove space characters from the keyword
#
gsub(/[[:space:]]/,"",$1)
#
# If the keyword matches what we are looking for, print the value after
# removing leading and trailing spaces, tabs, double quotes and hashes
#
if ( $1 ~ "archivedevice" )
print gensub(/^[ \t"#]+|[ \t"#]+$/,"","g",$2)
}
Indeed, this does not meet the requirement for readibility, but just to give you an idea of the power of GNU awk language. Hopefully, let's wait for opinions by some real awk guru. Ghostdog, are you there?!
@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me
Did anyone else have success with this one?
@ghostdog74: So I always like to see what solutions you come up with, but I have copied and pasted this one and it doesn't seem to work for me
Did anyone else have success with this one?
did you use gawk? and i have revised a bit to take care of tabs as well.
I've created a test input and attached it (because it loses tabs when pasted). Here it is pasted for illustration
Code:
# Test input for ArchiveDevice parsing programs
# Non-canonical values numbered for ease of output checking
Device {
Name = Floppy
Media Type = Floppy
ArchiveDevice = /mnt/floppy # Canonical
ArchivedEvIce=1./dev/sr0# Mixed case,no whitespace
A r c h i VE device = 2./dev/cdrom # Mixed case, aberrant whitespace
A r c h i VE device = "3./dev/cdrom" # Mixed case, aberrant whitespace, quoted value
A r c h i VE device = "4. /dev/cdrom X" # Mixed case, aberrant whitespace, quoted value-with-whitespace
A r c h i VE device = "5. #/dev/cdrom X" # Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
RemovableMedia = yes;
Random Access = Yes;
AutomaticMount = yes; # when device opened, read it
AlwaysOpen = no;
}
Case 5 is just for the challenge!
The expected output is
Code:
/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4. /dev/cdrom X
5. #/dev/cdrom X
Tested with this input, none of the suggestions pass as many tests as the OP script (which passes all except test 5).
Regards style, maintainability is more important than terseness (although it is technically challenging, educational and fun to aim for that minimal one-liner). Hence my use of variables "keyword" and "data"; not necessary but they help toward "self-documenting code".
Regards performance, it is not an issue. This snippet is part of a bash script with projected run times from tens of minutes to an hour or so. Hence it was a good idea to drop my original match regex and work on every line.
EDIT:
Dropping variable "tab" in favour of using "\t" is sweet; it reduces clutter while maintaining legibility. Changing FS to "[ \t]*=[ \t]*" is also sweet, stripping trailing whitespace from the keyword and leading whitespace from the data. Perhaps it could be extended to strip any leading quote from the data "[ \t]*=[ \t]*['\"]?"
The test file is incomplete. It does not include single quotes or the aberrant case of = in the data.
I forgot the possibility to use regexp in FS.. nice trick!
Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?
Jokes apart, I'd like to discuss about the 5th case. I would assume that any aberrant comment inside the value (data) must be at least embedded in quotes, otherwise any interpreter would consider it as a real comment and ignore the rest. Made this assumption, I cannot find any regular expression to match only the comments outside quotes. For this reason, I'd protect it from the regexp that removes trailing comments, by substituting it with something else, then restoring it later. For example:
Code:
BEGIN {
FS = "="
IGNORECASE = 1
}
{
#
# Remove space characters from the keyword
#
gsub(/[[:space:]]/,"",$1)
#
# Protect hash inside the value
#
if ( $2 ~ /".*#.*"/ )
sub(/#/,"PROTECT",$2)
#
# If the keyword matches what we are looking for, print the value after
# removing:
# 1. trailing comments
# 2. leading and trailing spaces, tabs and double quotes.
# Also restore protected hash inside the value
#
if ( $1 ~ "archivedevice" ) {
sub(/#.*$/,"",$2)
sub(/PROTECT/,"#",$2)
print gensub(/^[ \t"]+|[ \t"]+$/,"","g",$2), NR
}
}
Please note that in my previous post, I forgot to remove entire comments from the end of the line. I've added it here.
Since I keep an eye for compatibility, another clue is: what if we don't run this on GNU awk? In this case I'd "translate" the code to something more similar to your original one:
Code:
BEGIN {
FS = "="
}
{
#
# Remove space characters from the keyword
#
gsub(/[ \t]/,"",$1)
#
# Protect hash inside the value
#
if ( $2 ~ /".*#.*"/ )
sub(/#/,"PROTECT",$2)
#
# If the keyword matches what we are looking for, print the value after
# removing:
# 1. trailing comments
# 2. leading and trailing spaces, tabs and double quotes.
# Also restore protected hash inside the value
#
if ( tolower($1) ~ "archivedevice" ) {
sub(/#.*$/,"",$2)
sub(/PROTECT/,"#",$2)
sub(/^[ \t"]+/,"",$2)
sub(/[ \t"]+$/,"",$2)
print $2
}
}
this works even in nawk on a old Solaris Sparc (tested). As you can see, it not differ much from yours (except for the explicit assignment of "keyword" and "data"), hence my final comment is reduced to the not strictly necessary matching regexp.
Well I haven't found a winner for 5 yet, but the addition of \" in the gsub got me the rest.
My query would be whether or not '4. /dev/cdrom X' would be correct?
I could be wrong but as we are talking devices I assume the X should be attached, ie /dev/cdromX
So that was my output:
btw. I copied catkin's original code above and it only returned the first 2 correct entries for me (although I have discovered
Ubuntu only has mawk so not sure if that causes issues (apart from not support IGNORECASE))
My query would be whether or not '4. /dev/cdrom X' would be correct?
Good catch. Indeed the attached test file is a non-sense. I meant it as an exercise for testing code, but I doubt someone would really put "#" and numbers in the device field.
Hey colucix, I tried to implement your PROTECT idea ( I am still learning ) but when the sub is finished
if I print $0 it shows me that the fifth entry has now had the "=" removed. Simplified it to not worry about the
final output and just print in the if ... can someone tell me what I am doing wrong?
A r c h i VE device = "5. #/dev/cdrom X" # Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
A r c h i VE device "5. PROTECT/dev/cdrom X" # Mixed case, aberrant whitespace, quoted value-with-whitespace-and#
Notice the line with PROTECT (second print $0) now has no = and has lost some white space.
Any clues??
Yes. The = sign is part of FS. When you alter a field using sub, the whole record is "re-built" using OFS (default is a space) as field separator. Hence all the content matching the input FS is lost.
Charles, the configuration file you've attached for testing is really and totally "aberrant"! Who would write this weirdness to maintain his own backup facility?!?
Hello Alex
Extreme, I know, but if the input file is valid for Bacula, it should be valid for the script and there are elements of "if you learn how to do it right once, it's easy to do it right afterwards" and familiarising with the power of awk you mentioned in your first post.
Actually it just got more extreme because Bacula accepts backslash escapes in quoted strings so there's a new test line:
Code:
ArchiveDevice = "6. #/dev/c\drom \\ \" X" # Quoted value with whitespace and # and backslash escapes
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.