[SOLVED] Small awk script: critiques sought

catkin · 03-20-2010, 01:21 AM

Quote:

Originally Posted by grail

btw. I copied catkin's original code above and it only returned the first 2 correct entries for me

That was probably because the quoted character in tab = " " became space when I copy-pasted it into the OP; it was a tab in the original. A better technique (more legible) would have been for me to use tab = "\t".

catkin · 03-20-2010, 01:29 AM

Thanks again for all your interest and suggestions

Having moved the goal posts by requiring support for backslash escapes in quoted strings (in accordance with Bacula's usage), here's my current version. It incorporates many suggestions from this thread but can probably still be improved. For maintainability, it uses an extra { }, to allow room for the "# Backslash escape" comment.

Code:

#!/usr/bin/awk -f

BEGIN {
    FS = "[ \t]*=[ \t]*"
    IGNORECASE = 1
}

{
    gsub( /[ \t]*/, "", $1 )                # Remove any spaces and tabs from keyword
    if ( $1 == "ArchiveDevice" ) {
        if ( substr( $2, 1, 1 ) == "\"" )
        {                                   # Value is a quoted string
            value = ""
            $2 = substr ( $2, 2, match( $2, /[^\\]\"/ ) - 1 )
            for ( i = 1; i <= length( $2 ); i++ )
            {
                char = substr( $2, i, 1 )
                if ( char != "\\" ) value = value char
                else
                {                           # Backslash escape
                    if ( substr( $2, i + 1, 1 ) == "\\" )
                    {                       # Escaped \ so keep one
                        value = value "\\"
                        i++
                    }
                }
            }
        }
        else
        {                                   # Value is unquoted
            sub( /[ \t#].*$/, "", $2 )      # Strip from the first space, tab or # to end of line
            value = $2
        }
        print value
    }
}

grail · 03-20-2010, 02:00 AM

Well I am still 6 for 6 with adding backslash support (assuming from 4 onwards you accept eg 4./dev/cdromX)

Code:

#!/usr/bin/awk -f

BEGIN {
    FS="[ \t]*=[ \t]*"
    OFS="="
    IGNORECASE=1
}

{
    if($2 ~ /".*#.*"/)
        sub(/#/,"",$2)

    gsub(/[ \t\"\\]+|#.*$/,"")

    if($1~/archivedevice/)
        print $2
}

catkin · 03-20-2010, 02:55 AM

Quote:

Originally Posted by grail

Well I am still 6 for 6 with adding backslash support (assuming from 4 onwards you accept eg 4./dev/cdromX)

Code:

#!/usr/bin/awk -f

BEGIN {
    FS="[ \t]*=[ \t]*"
    OFS="="
    IGNORECASE=1
}

{
    if($2 ~ /".*#.*"/)
        sub(/#/,"",$2)

    gsub(/[ \t\"\\]+|#.*$/,"")

    if($1~/archivedevice/)
        print $2
}

Thanks grail

I tried it but got

Code:

1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4./dev/cdromX
5./dev/cdromX
6./dev/cdromX

Latest test input file attached FYI.

EDIT: expected test output is

Code:

/mnt/floppy
1./dev/sr0
2./dev/cdrom
3./dev/cdrom
4.      /dev/cdrom       X
5.      #/dev/cdrom      X
6.      #/dev/cdrom  \ "         X

grail · 03-20-2010, 04:44 AM

hmmm ... I realise we should be looking for your solution but I guess I am curious as to whether or not the output
you desire is of any use? (not trying to be difficult by the way, just trying to understand)

grail · 03-20-2010, 04:44 AM

hmmm ... I realise we should be looking for your solution but I guess I am curious as to whether or not the output
you desire is of any use? (not trying to be difficult by the way, just trying to understand)

catkin · 03-20-2010, 05:05 AM

Quote:

Originally Posted by grail

hmmm ... I realise we should be looking for your solution but I guess I am curious as to whether or not the output
you desire is of any use? (not trying to be difficult by the way, just trying to understand)

Good question. No -- it does not have any practical application beyond testing this script. Nobody in their right minds would call a device or a mount point '/dev/cdrom \ " X' but it would be a valid name so they could. Any character can be used in the name of a Linux file including backspace and newline (/ are path component separators). As test data, by pushing the test to extremes that are highly improbable in real life, the robustness of the code is tested.

Some people advise that spaces should never be used in file names and that was relatively rare until GUI file managers that facilitate names to suit the user. Names including quotes and other exotica are not rare, example "Brian's CV". Even in a command-line only environment, very bizarre file names could be created such as a single backspace (hence the HOWTOs about deleting such files). Systems software, including backup software, must handle such file names without issue.

grail · 03-20-2010, 06:30 AM

Rightio ... I see where you are coming from and probably need the ghostdog ledge to get the escaping of \ to show up but
the below has your format you requested for all 6:

Code:

!/usr/bin/awk -f

BEGIN {
    FS="[ \t]*=[ \t]*"
    OFS="="
    IGNORECASE=1
    f = 1
    g = 0
}

f && match($2, /".*"/){
    keep = substr($2, 2, (RLENGTH - 2))
    gsub(/\\+/, "", keep)
    print keep
    f = 0
    g = 1
}

f{
    gsub(/[ \t\"\\]+|#.*$/,"")

    if( $1 ~ /archivedevice/ )
        print $2

}

g{
    f = 1
    g = 0
}

Probably some redundancy that someone can tell me about too

grail · 03-20-2010, 06:30 AM

Rightio ... I see where you are coming from and probably need the ghostdog ledge to get the escaping of \ to show up but
the below has your format you requested for all 6:

Code:

!/usr/bin/awk -f

BEGIN {
    FS="[ \t]*=[ \t]*"
    OFS="="
    IGNORECASE=1
    f = 1
    g = 0
}

f && match($2, /".*"/){
    keep = substr($2, 2, (RLENGTH - 2))
    gsub(/\\+/, "", keep)
    print keep
    f = 0
    g = 1
}

f{
    gsub(/[ \t\"\\]+|#.*$/,"")

    if( $1 ~ /archivedevice/ )
        print $2

}

g{
    f = 1
    g = 0
}

Probably some redundancy that someone can tell me about too

catkin · 03-20-2010, 07:35 AM

Quote:

Originally Posted by grail

Rightio ... I see where you are coming from and probably need the ghostdog ledge to get the escaping of \ to show up but
the below has your format you requested for all 6:

Code:

!/usr/bin/awk -f

BEGIN {
    FS="[ \t]*=[ \t]*"
    OFS="="
    IGNORECASE=1
    f = 1
    g = 0
}

f && match($2, /".*"/){
    keep = substr($2, 2, (RLENGTH - 2))
    gsub(/\\+/, "", keep)
    print keep
    f = 0
    g = 1
}

f{
    gsub(/[ \t\"\\]+|#.*$/,"")

    if( $1 ~ /archivedevice/ )
        print $2

}

g{
    f = 1
    g = 0
}

Probably some redundancy that someone can tell me about too

That works except for the missing \ you already know about.

There's some redundancy in match($2, /".*"/) because we know the input is syntactically valid for Bacula so, if the value after the = and any spaces+tabs begins with a " then the closing " is also present so it is only necessary to check whether it begins with a quote: match($2, /^"/).

grail · 03-20-2010, 08:30 AM

Hi catkin

Your last bit of info there is not quite right with regard to my script as it needs the full match to work.
If I only look at the start for " then the value for RLENGTH is equal to the largest find, in this case always only
1, however, by using the whole regex ".*" it then looks at all characters that make that regex true.

eg. where $2 contains "5. #/dev/cdrom X" then RLENGTH = 20

catkin · 03-20-2010, 08:39 AM

Quote:

Originally Posted by grail

Hi catkin

Your last bit of info there is not quite right with regard to my script as it needs the full match to work.
If I only look at the start for " then the value for RLENGTH is equal to the largest find, in this case always only
1, however, by using the whole regex ".*" it then looks at all characters that make that regex true.

eg. where $2 contains "5. #/dev/cdrom X" then RLENGTH = 20

Sorry, grail -- I missed that RLENGTH use.

catkin · 03-23-2010, 05:40 AM

In case anyone is interested, this thread helped in developing a bash function with embedded awk to parse lines from Bacula .conf files. Here it is. Suggestions for doing it more elegantly appreciated.

Code:

#--------------------------
# Name: parse_conf_line
# Purpose: parses a conf file line
# Usage:
#   $1: line to parse
# Global variables envalued: keyword, keyword_org, conf_values[]
#--------------------------
function parse_conf_line {

    fct "${FUNCNAME[0]}" "started. \$1: '$1'"

    local line

    line=$1

    #echo DEBUG: $LINENO: eval "$( echo "$line" | $awk '
    eval "$( echo "$line" | $awk '
        BEGIN {
            #FS = "[ \t]"
            squote = "\047"
            n_values=0
        }

        {
            # Strip any comment and any spaces+tabs before it
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # This is not so easy because a # within a
            # quoted string does not introduce a comment and
            # an escaped " (that is \") does not terminate a
            # quoted string.
            in_string = 0                                   # False
            for ( i = 1; i <= length( $0 ); i++ )
            {
                char = substr( $0, i, 1 )
                if ( char == "#" && in_string == 0 )
                {
                    $0 = substr( $0, 1, i - 1 )
                    sub( /[ \t]*$/, "", $0 )
                    break
                }
                else if ( char == "\"" )
                {
                    if ( in_string == 0 ) in_string = 1
                    else if ( substr( $0, i - 1, 1 ) != "\\" ) in_string = 0
                }
            }

            # Get keyword and value(s) string
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            split( $0, array, /[ \t]*=[ \t]*/ )
            keyword = array[1]
            print "keyword_org=" squote keyword squote
            keyword = tolower( keyword )
            gsub( /[ \t]*/, "", keyword )                   # Remove any spaces and tabs from keyword
            print "keyword=" keyword

            # Get individual values from value(s) string
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            values_string = array[2]
            while ( length( values_string ) > 0 )
            {
                value = ""
                if ( substr( values_string, 1, 1 ) == "\"" )
                {                                           # Value is a quoted string
                    buf = substr( values_string, 2, match( values_string, /[^\\]"/ ) - 1 )
                    # Strip quoted string just taken
                    values_string = substr( values_string, length( buf ) + 3 )
                    # Copy to value, processing any escapes
                    for ( i = 1; i <= length( buf ); i++ )
                    {
                        char = substr( buf, i, 1 )
                        if ( char != "\\" ) value = value char
                        else
                        {                                   # Backslash escape
                            if ( substr( buf, i + 1, 1 ) == "\\" )
                            {                               # Escaped \ so keep one
                                value = value "\\"
                                i++
                            }
                        }
                    }
                }
                else
                {                                           # Value is unquoted
                    value = values_string
                    sub( /[ \t].*$/, "", value )            # Strip anything after space or tab
                    # Strip value string just taken
                    values_string = substr( values_string, length( value ) + 1 )
                }

                # Write shell script variable assignment
                # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                print "conf_values[" n_values++ "]=" squote value squote
                if ( n_values > 10 ) exit

                # Clean up for the next loop pass
                # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                sub( /^[ \t]*/, "", values_string )         # Strip leading spaces and tabs
            }
        }' \
    )"

    fct "${FUNCNAME[0]}" 'returning'

}  # end of function parse_conf_line

grail · 03-23-2010, 08:04 AM

Hey catkin

Good work on the finished script. Thought I would just throw my hand in one more time to include
your extra slashes and your values waiting to be eval'ed:

Code:

#!/usr/bin/awk -f

BEGIN {
	FS="[ \t]*=[ \t]*"
	OFS="="
	IGNORECASE=1
	f = 1
	g = 0
	cnt = 0
}

{ 
	key_org = "'"$1"'"
	key = tolower($1)
	gsub(/[ \t]+/,"",key)
}

f && match($2, /".*"/){

	value = substr($2, 2, (RLENGTH - 2))

	if (value ~ /\\\\/)
		gsub(/\\\\/, "SAVE", value)

	gsub(/\\+/, "", value)
	gsub("SAVE", "\\", value)

	f = 0
	g = 1
}

f{
	gsub(/[ \t\"\\]+|#.*$/,"",$2)

	if( key ~ /archivedevice/ )
		value = $2

}

g{
	f = 1
	g = 0
}

{
	print "keyword_org="key_org
	print "keyword="key
	print "conf_values["++cnt"]='"value"'"
}

Thanks for the learning

catkin · 03-23-2010, 11:10 AM

Quote:

Originally Posted by grail

Thanks for the learning

Glad you're enjoying the challenge. New test file attached FYI. It adds to the previous test file by including:

Comments not following an "=".
Multiple values after a "keyword ="
"Resource definition" stanzas starting with a "JobCycle {" line and ending with a "}" line. For these, the awk should set keywords "jobcycle{" and "}".