LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-24-2011, 09:27 AM   #1
chrisF682
LQ Newbie
 
Registered: Sep 2011
Posts: 8

Rep: Reputation: Disabled
Data manipulation with awk


Hi all,

I have a log file with the following format:

text line 1
text line 2
text line 3
timestamp1 data1 data2 data3 data4
timestamp2 data1 data2 data3 data4
...

data1, 2 and 4 are numbers in decimal, data3 is a number in hex like "A00"
Now I want to copy the whole file into a new file and while copying, data3 should be converted into decimal.
How can I do this with a script? I guess it works with awk, but how??

Regards,

Chris
 
Old 09-24-2011, 09:34 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Maybe this will give you some ideas:
Code:
awk 'BEGIN{x=0xa00;printf "%d\n", x}'
 
Old 09-24-2011, 03:37 PM   #3
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Actually, it is not that simple. 'mawk' or 'gawk --non-decimal-data' can convert 0xHH -format hexadecimal numbers in input. Grail's example shows that you can use hexadecimal constants within the awk scripts, but here, the hexadecimals are in input, not in script code. So,
Code:
mawk '{ $3 = int($3); print }' input-file > output-file

    # or

gawk '{ $3 = strtonum($3); print }' input-file > output-file
should work if the third fields contain the 0x prefixes. If they don't, they're interpreted as decimal (decimal integers for mawk).

If your data does not contain the 0x prefixes, or only contains them occasionally, you can use regular expression matching to add the prefix if necessary:
Code:
mawk '{ if ($3 !~ /^0[Xx]/) $3 = "0x" $3
        $3 = int($3)
        print
      }' input-file > output-file

    # or

gawk '{ if ($3 !~ /^0[Xx]) $3 = "0x" $3
        $3 = strtonum($3)
        print
      }' input-file > output-file
Note that in awk scripts have no string concatenation operator: expression a b concatenates variables a and b into one string. It can be a bit confusing if you don't remember it. For example, c = 1 2 is the same thing as c = "12" in awk scripts.

Finally, note that the command will generate a second file. You should always check that the output file contents look sane, and with filters such as this one, that all records (lines) were converted. You can use wc -l to count the lines in a file or files. (In a script I consider it enough to check that the gawk or mawk command did not fail.)

Hope this helps,
 
Old 09-25-2011, 03:53 AM   #4
chrisF682
LQ Newbie
 
Registered: Sep 2011
Posts: 8

Original Poster
Rep: Reputation: Disabled
Hi Nominal Animal,

thank you very much for your response!
The "gawk '{ $3 = strtonum($3); print }'" does exacly what I want - unbelievable that it was so simple!
One small thing has left over: the gawk changes the TABs of the input file into spaces. I guess there is one more simple trick to keep the TABs?

Best regards,

Chris
 
Old 09-25-2011, 04:34 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
You will need to set the OFS to tab:
Code:
awk '$3 = strtonum($3)' OFS="\t" file
 
Old 09-25-2011, 02:17 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Grail is quite correct. Here is a more complex solution, though.

When you have both spaces and tabs as field separators, and you'd prefer to keep them exactly as-is, you can use a regular expression to replace just the bit you want (using gensub()). Here is an example for ChrisF682's case:
Code:
gawk '
 BEGIN { RS="(\n|\r|\n\r|\r\n)"
         FS="[\t\v\f ]+"
       }

       { if ($3 ~ /^0[Xx]/)
             value = strtonum($3)
         else
             value = strtonum("0x" $3)
         line = $0
         line = gensub(/([\t\v\f ]*)[^\t\v\f ]+/, "\\1" value, 3, line)
         printf("%s%s", line, RT)
       }

     ' input-file > output-file
I prefer to put setting the record (line) and field separators in a BEGIN rule; I think it is clearer this way. This one supports all newline conventions (Unix, old Mac, weird, and DOS, respectively). The field separator accepts any number of consecutive tabs, vertical tabs, formfeeds, and spaces as a single field separator. Although GNU awk supports character classes, they depend on the locale; I like to spell out exactly what I want.

The record rule starts very much like before. The decimal value of the third field is stored in variable value.

The variable line is initialized to contain the entire record (without the newline at end), exactly as gawk read it.

The gensub() replaces the third match of (possible field separator and some field contents) with (the same field separator and the converted decimal value). Really, it does all the hard work here. If there is no third match, it changes nothing.

If you want to modify multiple fields at once, I intentionally wrote the scriptlet so that you can just duplicate (and modify) the gensub() line.

The printf line prints the possibly modified record, but also the same newline gawk saw, RT. If you want the output to use Unix newlines, you can change the printf line to printf("%s\n", line) instead.

To remind that this gawk scriptlet works on the third field of each input record (line), I bolded the 3's. If you were to change all four occurrences to say 1, it would work on the first field instead.

Note that the scriptlet assumes that if a record (line) begins with whitespace, that whitespace is just indentation. It does NOT mean that the first field is empty. However, this is just how awk normally works, so it should not be a surprise to anyone.

I hope you find this useful, or at least interesting.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk gsub() command - string (column) manipulation - substitution casperdaghost Linux - Newbie 1 03-08-2010 02:12 AM
Text file manipulation: selecting specific lines/columns using awk and print CHARL0TTE Linux - Newbie 2 02-27-2010 02:40 AM
Row manipulation with awk SHIFTA Linux - Newbie 1 11-05-2009 10:37 PM
Data manipulation Yogiz Linux - Newbie 2 12-11-2007 08:00 PM
help with data manipulation SeT Programming 2 10-20-2004 07:32 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration