[SOLVED] How to parse and modify these keywords using shell script?

corone · 04-14-2011, 12:04 PM

Hello,

Code:

system information

model = xxx
specs = yyy
mode = zzz

model = iii
specs = jjj
mode = kkk

system information

model = aaa
specs = bbb
mode = ccc

model = ddd
specs = eee
mode = fff

There is a file with that format of each models' information.
I don't think that's good format, but I cannot change that format.

I needed to modify the model name, 'model = xxx' as 'model = abc'.
So I tried like the following.

Code:

sed -i "/system information/,/model = /s/model = .\+/model = abc/" filename

But this script modified not only 'model = xxx' but also 'model = aaa' as 'model = abc'.

And I don't know how to parse and modify 'model = iii' and 'model = ddd'.

The only clue to parse 'model = ddd' is the second 'model = ' after the second 'system information'. But how to parse the second keyword?
Is it possible with 'sed'?

I sometimes have to modify the information of the file.

using shell script if possible.
Python is ok. (Shell script is better for me.)

Thank you.

Nominal Animal · 04-14-2011, 06:34 PM

Quote:

Originally Posted by corone

I don't think that's good format, but I cannot change that format.

I needed to modify the model name, 'model = xxx' as 'model = abc'.

If that's all, then

Code:

sed -e 's|^\([\t ]*model[\t ]*=[\t ]*\)xxx[\t ]*$|\1abc|' -i file

should work -- I added the patterns for optional horizontal whitespace --, but you'll have quite a hard time getting the xxx and abc strings embedded correctly in the expression. You could use awk instead:

Code:

awk -v "old=xxx" -v "new=abc" '($2 == "=" && tolower($1) == "model" && tolower($3) == tolower(old)) { $3 = new } { print $0 }' infile

With awk, you'll have to redirect the output to a new file, and replace the original file if successful, though. The tolower()s make the comparisons case insensitive. Put that into a shell script, with proper temporary file handling, usage, and so on:

Code:

#!/bin/bash
if [ $# -ne 3 ]; then
    echo "" >&2
    echo "Usage: $0 [ -h | --help ]" >&2
    echo "       $0 old-model new-model datafile" >&2
    echo "" >&2

    [ $# -eq 0 ] && exit 0
    [ "$1" == "-h" ] && exit 0
    [ "$1" == "--help" ] && exit 0
    exit 1
fi

# Create a safe autodeleted temporary directory.
WORK="`mktemp -d`" || exit $?
trap "rm -rf '$WORK'" EXIT

# Run the awk command.
awk -v "oldmodel=$1" -v "newmodel=$2" '
    BEGIN {
        # Set record separator to any newline convention.
        RS = "(\r\n|\n\r|\r|\n)"

        # Old model is case insensitive.
        oldmodel = tolower(oldmodel)
    }

    ($2=="=" && tolower($1)=="model" && tolower($3) == oldmodel) {
        $3 = newmodel
    }

    {   print $0
    }

' "$3" > "$WORK/file" || exit $?

# Success. Clone the original file mode, if possible.
chmod --reference="$3" "$WORK/file" &>/dev/null

# Replace the original file.
mv -f "$WORK/file" "$3" || exit $?

# Done.
exit 0

However, the above awk expression will not handle values with whitespace in them. It also ignores the section header you might have in your file.

If you need to limit the replacement to under a specific header, and/or your value strings may contain whitespace, I'd use a bit more complex awk part in the bash + awk script:

Code:

#!/bin/bash
if [ $# -ne 4 ]; then
    echo "" >&2
    echo "Usage: $0 header oldmodel newmodel file" >&2
    echo "" >&2
    [ $# -eq 0 ] && exit 0
    [ "$1" == "-h" ] && exit 0
    [ "$1" == "--help" ] && exit 0
    exit 1
fi
HEADER="$1"
OLD="$2"
NEW="$3"
FILE="$4"

# Create an automatically removed temporary directory.
WORK="`mktemp -d`" || exit $?
trap "rm -rf '$WORK'" EXIT

# Run awk with the necessary parameters, saving the output to a temporary file.
awk -v "header=$HEADER" -v "old=$OLD" -v "new=$NEW" '
    BEGIN {
        # Record separator is a newline, in any convention.
        RS="(\r\n|\n\r|\r|\n)"

        # Field separator is =, including any whitespace around it.
        FS="[\t\v\f ]*=[\t\v\f ]*"

        # header and old are case insensitive; convert to lower case.
        header = tolower(header)
        old = tolower(old)

        # Not within a the correct header section.
        active=0
    }

    (NF == 1) {
        # Trim out whitespace and comments from the header string.
        value = tolower($0)
        gsub(/[\t\n\v\f\r ]+/, " ", value)
        sub(/^ +/, "", value)
        sub(/[#;].*$/, "", value)
        sub(/ +$/, "", value)

        # Set active nonzero if this is the correct header.
        active = (value == header)
    }

    (NF >= 2 && active && $1 ~ /^[\t\n\v\f\r]*[Mm][Oo][Dd][Ee][Ll]$/) {

        # Trim out whitespace from the model string.
        value = tolower($2)
        gsub(/[\t\n\v\f\r ]+/, " ", value)
        sub(/ +$/, "", value)
        sub(/^ +/, "", value)

        # If matches, replace the old value but retain whitespace.
        if (value == old)
            $0 = gensub(/(=[\t\v\f ]*).*/, "\\1" new, 1, $0)
    }

    {   print $0 }

    ' "$FILE" > "$WORK/file" || exit $?

# Copy the access mode from the original file.
chmod --reference="$FILE" "$WORK/file" &>/dev/null

# Replace the original with the temporary file.
mv -f "$WORK/file" "$FILE" || exit $?

# All done.
exit 0

The latter one also tries very hard to keep whitespace intact in the file. Neither of these are the best solution in any way, but they should help you in writing your own.

Note that the way I redefine RS in the awk scripts mean that it accepts the input file using any newline convention, converting them to standard Unix newlines (\n).

Hope this helps.

grail · 04-14-2011, 07:50 PM

I must have missed something here?? Is there a reason something simple like:

Code:

sed -i '/model/s/xxx/abc/' file

won't work?

Based on the description I presume that the model to name (ie. xxx) is a unique combination, ie. no 2 models will
have xxx as a name?

Also, @ Nominal, just a query, as I have seen you use it before, what is the difference between:

Code:

RS="(\r\n|\n\r|\r|\n)"

# and

RS="[\n\r]*" # or maybe a '+' if we assume we want all not including any after the last line

Sorry for small hijack ... just curious

kurumi · 04-14-2011, 07:55 PM

Code:

$ ruby -ane '$F[-1]="abc" if /^model.*=.*xxx$/;puts $F.join("\s")' file

kurumi · 04-14-2011, 08:02 PM

Quote:

Originally Posted by grail

Also, @ Nominal, just a query, as I have seen you use it before, what is the difference between:

Code:

RS="(\r\n|\n\r|\r|\n)"

# and

RS="[\n\r]*" # or maybe a '+' if we assume we want all not including any after the last line

Sorry for small hijack ... just curious

your version RS="[\n\r]*" (or + ) is equivalent to RS="" while Nominal's version makes the newlines a record itself. A simple test shows this

Code:

$ cat file
1

2



3

4

$ awk 'BEGIN{ RS=""}{print "->"$0}' file                               
->1                                                                                          
->2
->3
->4
$ awk 'BEGIN{ RS="(\r\n|\n\r|\r|\n)"}{print "->"$0}' file
->1
->
->2
->
->
->
->3
->
->4

$ awk 'BEGIN{ RS="[\n\r]+"}{print "->"$0}' file
->1
->2
->3
->4

I believe awk is able to take care of universal record separator so there's actually no need to set RS to RS="(\r\n|\n\r|\r|\n)".

grail · 04-14-2011, 11:01 PM

@kurumi - thanks for the explanation. It makes sense, but I believe the idea was to catch the scenario if the file is written in dos :

Quote:

Originally Posted by Nominal Animal

Note that the way I redefine RS in the awk scripts mean that it accepts the input file using any newline convention, converting them to standard Unix newlines (\n).

corone · 04-14-2011, 11:57 PM

I am so happy that you gave me many answers.
Thank you.
But I should have told you more.

The above information is just an example.

Code:

system information

model = 
specs = 
mode = 

model = 
specs = 
mode = 

system information

model = 
specs = 
mode = 

model = 
specs = 
mode =

I don't know the previous data at all.
I just know the format.

So the script should not parse the value, 'xxx' or 'yyy' from the file.
That's why I mentioned 'the second model' and 'the second system information'.

Please, give me more help.
Thank you.

Nominal Animal · 04-15-2011, 01:15 AM

Quote:

Originally Posted by grail

Also, @ Nominal, just a query, as I have seen you use it before, what is the difference between:
RS="(\r\n|\n\r|\r|\n)" and RS="[\n\r]*"

RS="(\r\n|\n\r|\r|\n)" matches all newline conventions, and counts empty records, whereas RS="[\n\r]*" does not count empty records.

Quote:

Originally Posted by kurumi

I believe awk is able to take care of universal record separator so there's actually no need to set RS to RS="(\r\n|\n\r|\r|\n)".

Unfortunately, GNU awk 3.1.6 at least does not consider \r by itself a newline; the default RS seems to be equivalent to RS="\r?\n". Mac OS prior to X used \r as a newline, and I still sometimes run into such files.
_ _ _

As you might quess, I thought a bit further about processing configuration files, and reread the relevant sections in the GNU Awk User Manual. I realized there is a very simple way to set up RS and FS, and use RT and OFS to handle all this stuff.

Here is my skeleton awk script for parsing name=value -type configuration files. It automatically retains all whitespace, including newline conventions, and even when replacing the value. Furthermore, it fully supports shell-type comments, as long as they either start at the beginning of the line, or are preceded by whitespace. The comments are not shown in $0, and your rules need not worry about them at all.

Code:

BEGIN {
    RS = "[\t\n\v\f\r ]*[\r\n]+[\t\n\v\f\r ]*"
    FS = "[\t\v\f ]*=[\t\v\f ]*"
}

# Retain newline convention and whitespace.
{
    ORS = RT
    if (match($0, /[\t\v\f ]*=[\t\v\f ]*/, recs))
        OFS = recs[0]
    else
        OFS = " = "
}

# Handle comment lines.
($1 ~ /^[\t\v\f ]*#/) {
    print $0
    next
}
($0 ~ /[\t\v\f ]#/) {
    rec = $0
    sub(/[\t\v\f ]#.*$/, "", $0)
    ORS = substr(rec, 1 + length($0)) RT
}

# Do your normal processing here.
# If (NF==2), name is in $1 and value in $2.
# If (NF==1) you have a header record.
# You can replace $1 or $2, and whitespace (and comment)
# will still be retained intact.
# If you want to delete the record, use
#     print "" ; next

# This here line prints the current record to output.
{ print $0 }

grail · 04-15-2011, 01:25 AM

Well whether you use Nominal's code (looks cool btw), sed or ruby, you are going to need to know something about what you need changed
otherwise everywhere that 'model' appears the corresponding 'name' will be changed. I am not saying you need to know the exact name, but you would have to know at least
which section and which entry, ie first system information section and the second model entry.

So I think you will need to know your data a little better before we can truly help you further.

Nominal Animal · 04-15-2011, 01:44 AM

Quote:

Originally Posted by corone

So the script should not parse the value, 'xxx' or 'yyy' from the file.
That's why I mentioned 'the second model' and 'the second system information'.

Here's a bash + awk script which takes four parameters: the section number, model number, new model value, and the file name. The script retains whitespaces and shell-type comments (beginning with #). Any line which does not have a = and is not empty or a comment is a header line, and starts a new section. The part before the first header line in the file is section 0. Run the script without parameters to see the usage.

Code:

#!/bin/bash

usage () {
    exec >&2
    echo ""
    echo "Usage: $0 [ -h | --help]"
    echo "       $0 section model value file"
    echo ""
    echo "This script replaces the model'th model line"
    echo "under the section'th header in file 'file' with 'value'."
    echo ""
    exit $1
}

[ $# -eq 0 ] && usage 0
[ "$1" == "-h" ] && usage 0
[ "$1" == "--help" ] && usage 0
[ $# -ne 4 ] && usage 1

SECTION="$1"
INDEX="$2"
NEWMODEL="$3"
FILE="$4"

WORK="`mktemp -d`" || exit $?
trap "rm -rf '$WORK'" EXIT

awk -v "model=$NEWMODEL" -v "section=$SECTION" -v "occurrence=$INDEX" '

    BEGIN {
        RS = "[\t\n\v\f\r ]*[\r\n]+[\t\n\v\f\r ]*"
        FS = "[\t\v\f ]*=[\t\v\f ]*"

        section = int(section)
        currsect = 0
        currmodel = 0
    }

    {
        ORS = RT
        if (match($0, /[\t\v\f ]*=[\t\v\f ]*/, recs))
            OFS = recs[0]
        else
            OFS = " = "
    }
    ($1 ~ /^[\t\v\f ]*#/) {
        print $0
        next
    }
    ($0 ~ /[\t\v\f ]#/) {
        rec = $0
        sub(/[\t\v\f ]#.*$/, "", $0)
        ORS = substr(rec, 1 + length($0)) RT
    }

    (NF == 1) {
        currsect++
        currmodel = 0
    }

    (NF == 2 && currsect == section && tolower($1) == "model") {
        currmodel++
        if (currmodel == occurrence)
            $2 = model
    }

    {
        print $0
    }

' "$FILE" > "$WORK/file" || exit $?

if cmp -s "$FILE" "$WORK/file" ; then
    echo "$FILE: No changes." >&2
    exit 0
fi

chmod --reference="$FILE" "$WORK/file" &>/dev/null
mv -f "$WORK/file" "$FILE" || exit $?

echo "$FILE: Modified successfully." >&2
exit 0

If you save the above script as change-model.bash, then

Code:

./change-model.bash 2 3 'The New Model' models.txt

will change the third model=anything line under the second header to model=The New Model, in file models.txt. The script will tell if the file was modified or not, too.

grail · 04-15-2011, 04:10 AM

Nominal's new script goes only further to proving my point that without knowing something about the data, ie second section and third model along, that all the solutions
are kind of void until you bed down some more particulars.

Nominal Animal · 04-15-2011, 04:04 PM

Quote:

Originally Posted by grail

Nominal's new script goes only further to proving my point that without knowing something about the data, ie second section and third model along, that all the solutions are kind of void until you bed down some more particulars.

I agree.

corone, could you elaborate a bit on exactly what you are doing?

For example, if we knew that you need to modify say a hardware manifest, we could tell you that you'd save a lot of time by splitting the original manifest(s) into manageable parts (by, say, the header line, or by identifier in the section) -- for example, by splitting it into multiple files. Then it'd be much easier to modify each part separately. If you use a safe empty temporary directory to work in, you can name the parts as 001.part-identifier, 002.part-identifier, and so on. Then you can specify the part file name as *.part-identifier for each modification, to only modify specific parts, regardless of their order in the original manifest. Finally, merging the parts back to a single manifest is trivial: cat *.* (in the otherwise empty temporary directory).

The key is that the more complete picture of the problem we have, the better the solution.

It is always a good idea to describe what you've already tried. However, usually that's not enough. Your approach may be inefficient, for example. Therefore telling us also what is the entire task you wish to accomplish, not just the tricky bit you're having a problem with, is important for you to get a good solution. It also makes it much easier for others to give you advice. Sure, you'll probably get also advice that is not suitable for you for various reasons, but it never hurts to see how those solutions tick. You may be able to use some nugget in them to improve your solution.

corone · 04-20-2011, 09:23 AM

Thank you, Nominal Animal!!

Code:

./change-model.bash 2 3 'The New Model' models.txt

It works very very well.
That is exactly what I want.

I can never thank you enough.
I really really appreciate your helping me out.

I hope you to see my thanks.
And I wish I work with you in same office. =]

corone · 04-20-2011, 09:54 AM

Thank you for your advice, grail.
This is a little more expatiation.

There are four devices.

Code:

┌────┐
│Device1│
├────┤
│Device2│
└────┘
┌────┐
│Device3│
├────┤
│Device4│
└────┘

And this is a format for the devices.

Code:

system information

model = 
specs = 

model = 
specs = 

system information

model = 
specs = 

model = 
specs =

The first 'model =' is the model name of the Device1.
The second 'model =' is the model name of the Device2.
The third 'model =' is the model name of the Device3.
The fourth 'model =' is the model name of the Device4.

When the device is changed, the file should be modified automatically using a script.

I don't know the previous model name for the Device1.
I just know which device is changed as which model.

I think the format is very stupid.
The following would be much better.

Code:

system information

Device 1 model = 
specs = 

Device 2 model = 
specs =

Or at least,

Code:

system information

model 1 = 
specs = 

model 2 = 
specs =

Anyway I soved the problem with Nominal's kind help.

grail · 04-20-2011, 10:06 AM

In a way you have demonstrated that there is more information in that the devices run from top to bottom in line with the model and spec information.
This now means that your script simply needs to know which occurrence of model it is to change. Hence if you pass the number 2 it will wait
until it finds the second occurrence of model before initiating a change and exit immediately after.

So using Nominal's script you really only need the INDEX and NEWMODEL variables.

Glad you got it sorted