[SOLVED] Awk with missing fields

Linux_Kidd · 04-19-2012, 10:43 AM

so, another script for me using gnu awk v3.1.5.

i have input files where NF may vary, and, fields $7 $9 and $10 may be blank (hence it looks like one big space between $8 and $11). how to handle this in awk?

the reason why NF may vary is because one field in the file may look like this "name=xyz" or "name=joe doe"

firstfire · 04-19-2012, 11:07 AM

Hi.

If your fields are separated by a single space, you can do

Code:

$ echo '1 2  3 4' | awk -F'[ ]' '{print "["$4"]"}'
[3]

If you would set FS='<space>' then, by convention, field separator is one or more whitespace.

millgates · 04-19-2012, 11:08 AM

Can you give an example of the input you have and output you want?
Are all the fields in double quotes?
I think setting the FS variable in awk intelligently may do most of the work for you.

colucix · 04-19-2012, 11:53 AM

Most likely (and hopefully) the actual field separator in the input file is TAB. Please post it (or part of it) as requested, using CODE tags to preserve spacing. Thank you.

Linux_Kidd · 04-19-2012, 12:58 PM

ok, here's sample from windows txt file. i didnt look at it in hex yet.

some knowns about the data:
what seems consistent (they always exist, referencing my output below) is $1$2$3$4$5$6$8$11$12
...and $11 $12 always start with a *

Code:

 11/28/11 06:52:15 PEEL BANNANA     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 SYS
 11/28/11 06:52:15 PEEL ORANGE     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 IC
 11/28/11 06:52:15 PEEL APPLE     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 NET
 11/28/11 08:03:46 PEEL FRUIT PRD 2 F 01 APHSIP                     *08*-09 INI   NAME=joe doe     53510 M058
 11/28/11 09:31:17 PEEL GRAPES KRD 2 F 01 APHSIP   EXECUTE  NONE     *08*-88     > DTPI                         53510 M071

output is 16 fields pipe delimited, like this:

Code:

11/28/11|06:52:15|PEEL|BANNANA|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|SYS
11/28/11|06:52:15|PEEL|ORANGE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|IC
11/28/11|06:52:15|PEEL|APPLE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|NET
11/28/11|08:03:46|PEEL|FRUIT|PRD|2 F|01|APHSIP|||*08|*-09|INI|NAME=joe doe|53510|M058
11/28/11|09:31:17|PEEL|GRAPES|KRD|2 F|01|APHSIP|EXECUTE|NONE|*08|*-88|>|DTPI|53510|M071

grail · 04-19-2012, 01:18 PM

firstfire did put you on the right path but your data is not uniform. If we could assume that a space (or anything for that matter) were the delimiter then $5 which you have said is always
there and demonstrated that in the example is either PRD of KRD and yet, there is no consistency to a delimiter. In my opinion this makes it a lot more difficult. I would suggest that
you are now left with saying that each field has a specific length (ie. field 1 is 8 characters long) and trying to split the data based on this principle.

Linux_Kidd · 04-19-2012, 01:31 PM

yep, this is a pita, certainly a good newb problem to solve, but its just ascii(hex) and we can always manipulate that, etc. this is the data to work with, the only fields that have consistent constant length (if they exist) is $1 $2 $6 $11 and $12 (referencing my output), all others can vary in length, etc.

i'll ask if the txt files can be generated using a better delimiter like a "|" char.

i could sed the data 1st, replacing every \s+ with \s, but then how to determine which fields are actually missing? i'm just trying to help the crew here turn another human-heavy process into a automated one, really has nothing to do with security, if i cant solve it today i'll likely leave it for someone else, which likely means it will remain a human-heavy process.

David the H. · 04-19-2012, 01:42 PM

gawk, from v4 (I believe), has a new FPAT/patsplit feature, for defining fields based on regex patterns. It might help you work this one out.

http://www.gnu.org/software/gawk/man...ing-By-Content

Linux_Kidd · 04-19-2012, 01:56 PM

thnx Dave,
stuck with gnu 3.1.5 for now

well, if i sed -i 's/\s/|/g' my data file i get something that may be usable as the new output has constant NF, and data is in predictable field locations.

i should be able to get it to work, need to sed 1st, then awk it. will let you know.

as example, i can do a if statement like "if $10="" then print $9,$13 else print $10,$11

tricky, but seems doable.

after sed:

Code:

|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
S30
|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
VMC
|11/25/11|07:46:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N438
|11/25/11|07:57:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N396
|11/25/11|08:00:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:00:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:06:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
M366
|11/25/11|08:09:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
N430
|11/25/11|08:11:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||||47736|
M118
|11/25/11|08:12:48|PEEL|REM|REM|2|F|01|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:15:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||||47736|
N455
|11/25/11|08:22:18|PEEL|REM|REM|2|F|02|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:28:18|PEEL|REM|REM|2|F|03|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:32:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
M398
|11/25/11|08:40:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||47736|
M236
|11/25/11|08:40:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*10*-0B|INI|||NAME=xxx|yyy||||||47736|
N075
|11/25/11|08:41:48|PEEL|REM|REM|2|F|04|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432

grail · 04-19-2012, 03:28 PM

The sed is not required as this would be the same as the solution provided by firstfire, ie by using FS = " " then this is now equivalent to what you created with sed.

Linux_Kidd · 04-19-2012, 03:36 PM

FS of space is \s+ (is this correct, one or 40 spaces is considered a single FS?)
and this would not produce the same NF as sed did.
sed at least gave me constant NF

some fields seem to be predictable in max size, 8char max. i am off for a few days, will look at it next week. thnx.

grail · 04-19-2012, 03:56 PM

No, not \s+, but the exact example I provided:

Code:

FS=" "

That is a single space between the quotes.

millgates · 04-19-2012, 04:27 PM

This is a very clumsy and ugly solution, but you may be able to create regex that matches the line. For example, an expression like this one could work for your input as in post #5:

Code:

perl -pe 's/\s*([0-9\/]+)\s+([0-9:]+)\s+(\w+)\s+(\w+)\s+([PK]RD)\s+2 F\s+([0-9]+)*\s+APHSIP\s+(EXECUTE)*\s+(NONE)*\s+(\*..)\s*(\*...\s+(\S+)\s+(NAME=[^0-9]+)*\s*(\w)+\s+(\w+).*/$1|$2|$3|$4|$5|2 F|$6|APHSIP|$7|$8|$9|$10|$11|$12|$13|$14/'|sed -r 's/\s+\|/|/g'

where I assumed, that:
1) $5 is either PRD or KRD
2) $7 is a number or blank
3) $8 is allways APHSIP
4) $9 and $10 are EXECUTE and NONE or blank
5) $14 doesn't contain digits
6) #15 is a number or at least starts with a digit

you may need to make the expression more general based on what you know about the input. Unless you can make the input file mmore regular and predictable, it will be very difficult to find an efficient and reliable solution.

fantasy1215 · 04-19-2012, 07:36 PM

Sorry, I posted in the wrong thread!

Linux_Kidd · 04-19-2012, 08:24 PM

Quote:

Originally Posted by grail

No, not \s+, but the exact example I provided:

Code:

FS=" "

That is a single space between the quotes.

hmmm, so default FS is diff from FS=" "
default includes \s+