Data manipulation with awk

chrisF682 · 09-24-2011, 09:27 AM

Hi all,

I have a log file with the following format:

text line 1
text line 2
text line 3
timestamp1 data1 data2 data3 data4
timestamp2 data1 data2 data3 data4
...

data1, 2 and 4 are numbers in decimal, data3 is a number in hex like "A00"
Now I want to copy the whole file into a new file and while copying, data3 should be converted into decimal.
How can I do this with a script? I guess it works with awk, but how??

Regards,

Chris

grail · 09-24-2011, 09:34 AM

Maybe this will give you some ideas:

Code:

awk 'BEGIN{x=0xa00;printf "%d\n", x}'

Nominal Animal · 09-24-2011, 03:37 PM

Actually, it is not that simple. 'mawk' or 'gawk --non-decimal-data' can convert 0xHH -format hexadecimal numbers in input. Grail's example shows that you can use hexadecimal constants within the awk scripts, but here, the hexadecimals are in input, not in script code. So,

Code:

mawk '{ $3 = int($3); print }' input-file > output-file

    # or

gawk '{ $3 = strtonum($3); print }' input-file > output-file

should work if the third fields contain the 0x prefixes. If they don't, they're interpreted as decimal (decimal integers for mawk).

If your data does not contain the 0x prefixes, or only contains them occasionally, you can use regular expression matching to add the prefix if necessary:

Code:

mawk '{ if ($3 !~ /^0[Xx]/) $3 = "0x" $3
        $3 = int($3)
        print
      }' input-file > output-file

    # or

gawk '{ if ($3 !~ /^0[Xx]) $3 = "0x" $3
        $3 = strtonum($3)
        print
      }' input-file > output-file

Note that in awk scripts have no string concatenation operator: expression a b concatenates variables a and b into one string. It can be a bit confusing if you don't remember it. For example, c = 1 2 is the same thing as c = "12" in awk scripts.

Finally, note that the command will generate a second file. You should always check that the output file contents look sane, and with filters such as this one, that all records (lines) were converted. You can use wc -l to count the lines in a file or files. (In a script I consider it enough to check that the gawk or mawk command did not fail.)

Hope this helps,

chrisF682 · 09-25-2011, 03:53 AM

Hi Nominal Animal,

thank you very much for your response!
The "gawk '{ $3 = strtonum($3); print }'" does exacly what I want - unbelievable that it was so simple!
One small thing has left over: the gawk changes the TABs of the input file into spaces. I guess there is one more simple trick to keep the TABs?

Best regards,

Chris

grail · 09-25-2011, 04:34 AM

You will need to set the OFS to tab:

Code:

awk '$3 = strtonum($3)' OFS="\t" file

Nominal Animal · 09-25-2011, 02:17 PM

Grail is quite correct. Here is a more complex solution, though.

When you have both spaces and tabs as field separators, and you'd prefer to keep them exactly as-is, you can use a regular expression to replace just the bit you want (using gensub()). Here is an example for ChrisF682's case:

Code:

gawk '
 BEGIN { RS="(\n|\r|\n\r|\r\n)"
         FS="[\t\v\f ]+"
       }

       { if ($3 ~ /^0[Xx]/)
             value = strtonum($3)
         else
             value = strtonum("0x" $3)
         line = $0
         line = gensub(/([\t\v\f ]*)[^\t\v\f ]+/, "\\1" value, 3, line)
         printf("%s%s", line, RT)
       }

     ' input-file > output-file

I prefer to put setting the record (line) and field separators in a BEGIN rule; I think it is clearer this way. This one supports all newline conventions (Unix, old Mac, weird, and DOS, respectively). The field separator accepts any number of consecutive tabs, vertical tabs, formfeeds, and spaces as a single field separator. Although GNU awk supports character classes, they depend on the locale; I like to spell out exactly what I want.

The record rule starts very much like before. The decimal value of the third field is stored in variable value.

The variable line is initialized to contain the entire record (without the newline at end), exactly as gawk read it.

The gensub() replaces the third match of (possible field separator and some field contents) with (the same field separator and the converted decimal value). Really, it does all the hard work here. If there is no third match, it changes nothing.

If you want to modify multiple fields at once, I intentionally wrote the scriptlet so that you can just duplicate (and modify) the gensub() line.

The printf line prints the possibly modified record, but also the same newline gawk saw, RT. If you want the output to use Unix newlines, you can change the printf line to printf("%s\n", line) instead.

To remind that this gawk scriptlet works on the third field of each input record (line), I bolded the 3's. If you were to change all four occurrences to say 1, it would work on the first field instead.

Note that the scriptlet assumes that if a record (line) begins with whitespace, that whitespace is just indentation. It does NOT mean that the first field is empty. However, this is just how awk normally works, so it should not be a surprise to anyone.

I hope you find this useful, or at least interesting.