LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-19-2012, 10:43 AM   #1
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Rep: Reputation: 78
Awk with missing fields


so, another script for me using gnu awk v3.1.5.

i have input files where NF may vary, and, fields $7 $9 and $10 may be blank (hence it looks like one big space between $8 and $11). how to handle this in awk?

the reason why NF may vary is because one field in the file may look like this "name=xyz" or "name=joe doe"
 
Old 04-19-2012, 11:07 AM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

If your fields are separated by a single space, you can do
Code:
$ echo '1 2  3 4' | awk -F'[ ]' '{print "["$4"]"}'
[3]
If you would set FS='<space>' then, by convention, field separator is one or more whitespace.
 
Old 04-19-2012, 11:08 AM   #3
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
Can you give an example of the input you have and output you want?
Are all the fields in double quotes?
I think setting the FS variable in awk intelligently may do most of the work for you.
 
Old 04-19-2012, 11:53 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Most likely (and hopefully) the actual field separator in the input file is TAB. Please post it (or part of it) as requested, using CODE tags to preserve spacing. Thank you.
 
Old 04-19-2012, 12:58 PM   #5
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
ok, here's sample from windows txt file. i didnt look at it in hex yet.

some knowns about the data:
what seems consistent (they always exist, referencing my output below) is $1$2$3$4$5$6$8$11$12
...and $11 $12 always start with a *

Code:
 11/28/11 06:52:15 PEEL BANNANA     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 SYS
 11/28/11 06:52:15 PEEL ORANGE     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 IC
 11/28/11 06:52:15 PEEL APPLE     PRD 2 F    APHSIP                     *1C*-99 INI   NAME=CN                    53510 NET
 11/28/11 08:03:46 PEEL FRUIT PRD 2 F 01 APHSIP                     *08*-09 INI   NAME=joe doe     53510 M058
 11/28/11 09:31:17 PEEL GRAPES KRD 2 F 01 APHSIP   EXECUTE  NONE     *08*-88     > DTPI                         53510 M071
output is 16 fields pipe delimited, like this:
Code:
11/28/11|06:52:15|PEEL|BANNANA|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|SYS
11/28/11|06:52:15|PEEL|ORANGE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|IC
11/28/11|06:52:15|PEEL|APPLE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|NET
11/28/11|08:03:46|PEEL|FRUIT|PRD|2 F|01|APHSIP|||*08|*-09|INI|NAME=joe doe|53510|M058
11/28/11|09:31:17|PEEL|GRAPES|KRD|2 F|01|APHSIP|EXECUTE|NONE|*08|*-88|>|DTPI|53510|M071

Last edited by Linux_Kidd; 04-19-2012 at 01:08 PM.
 
Old 04-19-2012, 01:18 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
firstfire did put you on the right path but your data is not uniform. If we could assume that a space (or anything for that matter) were the delimiter then $5 which you have said is always
there and demonstrated that in the example is either PRD of KRD and yet, there is no consistency to a delimiter. In my opinion this makes it a lot more difficult. I would suggest that
you are now left with saying that each field has a specific length (ie. field 1 is 8 characters long) and trying to split the data based on this principle.
 
Old 04-19-2012, 01:31 PM   #7
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
yep, this is a pita, certainly a good newb problem to solve, but its just ascii(hex) and we can always manipulate that, etc. this is the data to work with, the only fields that have consistent constant length (if they exist) is $1 $2 $6 $11 and $12 (referencing my output), all others can vary in length, etc.

i'll ask if the txt files can be generated using a better delimiter like a "|" char.

i could sed the data 1st, replacing every \s+ with \s, but then how to determine which fields are actually missing? i'm just trying to help the crew here turn another human-heavy process into a automated one, really has nothing to do with security, if i cant solve it today i'll likely leave it for someone else, which likely means it will remain a human-heavy process.

Last edited by Linux_Kidd; 04-19-2012 at 01:40 PM.
 
Old 04-19-2012, 01:42 PM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
gawk, from v4 (I believe), has a new FPAT/patsplit feature, for defining fields based on regex patterns. It might help you work this one out.


http://www.gnu.org/software/gawk/man...ing-By-Content
 
Old 04-19-2012, 01:56 PM   #9
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
thnx Dave,
stuck with gnu 3.1.5 for now

well, if i sed -i 's/\s/|/g' my data file i get something that may be usable as the new output has constant NF, and data is in predictable field locations.

i should be able to get it to work, need to sed 1st, then awk it. will let you know.

as example, i can do a if statement like "if $10="" then print $9,$13 else print $10,$11

tricky, but seems doable.

after sed:
Code:
|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
S30
|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
VMC
|11/25/11|07:46:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N438
|11/25/11|07:57:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N396
|11/25/11|08:00:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:00:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:06:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
M366
|11/25/11|08:09:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
N430
|11/25/11|08:11:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||||47736|
M118
|11/25/11|08:12:48|PEEL|REM|REM|2|F|01|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:15:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||||47736|
N455
|11/25/11|08:22:18|PEEL|REM|REM|2|F|02|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:28:18|PEEL|REM|REM|2|F|03|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:32:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
M398
|11/25/11|08:40:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||47736|
M236
|11/25/11|08:40:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*10*-0B|INI|||NAME=xxx|yyy||||||47736|
N075
|11/25/11|08:41:48|PEEL|REM|REM|2|F|04|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432

Last edited by Linux_Kidd; 04-19-2012 at 02:21 PM.
 
Old 04-19-2012, 03:28 PM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
The sed is not required as this would be the same as the solution provided by firstfire, ie by using FS = " " then this is now equivalent to what you created with sed.
 
Old 04-19-2012, 03:36 PM   #11
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
FS of space is \s+ (is this correct, one or 40 spaces is considered a single FS?)
and this would not produce the same NF as sed did.
sed at least gave me constant NF

some fields seem to be predictable in max size, 8char max. i am off for a few days, will look at it next week. thnx.

Last edited by Linux_Kidd; 04-19-2012 at 03:37 PM.
 
Old 04-19-2012, 03:56 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
No, not \s+, but the exact example I provided:
Code:
FS=" "
That is a single space between the quotes.
 
Old 04-19-2012, 04:27 PM   #13
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
This is a very clumsy and ugly solution, but you may be able to create regex that matches the line. For example, an expression like this one could work for your input as in post #5:

Code:
perl -pe 's/\s*([0-9\/]+)\s+([0-9:]+)\s+(\w+)\s+(\w+)\s+([PK]RD)\s+2 F\s+([0-9]+)*\s+APHSIP\s+(EXECUTE)*\s+(NONE)*\s+(\*..)\s*(\*...\s+(\S+)\s+(NAME=[^0-9]+)*\s*(\w)+\s+(\w+).*/$1|$2|$3|$4|$5|2 F|$6|APHSIP|$7|$8|$9|$10|$11|$12|$13|$14/'|sed -r 's/\s+\|/|/g'
where I assumed, that:
1) $5 is either PRD or KRD
2) $7 is a number or blank
3) $8 is allways APHSIP
4) $9 and $10 are EXECUTE and NONE or blank
5) $14 doesn't contain digits
6) #15 is a number or at least starts with a digit

you may need to make the expression more general based on what you know about the input. Unless you can make the input file mmore regular and predictable, it will be very difficult to find an efficient and reliable solution.
 
Old 04-19-2012, 07:36 PM   #14
fantasy1215
Member
 
Registered: Oct 2011
Posts: 75

Rep: Reputation: Disabled
Sorry, I posted in the wrong thread!

Last edited by fantasy1215; 04-19-2012 at 07:37 PM.
 
Old 04-19-2012, 08:24 PM   #15
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
Quote:
Originally Posted by grail View Post
No, not \s+, but the exact example I provided:
Code:
FS=" "
That is a single space between the quotes.
hmmm, so default FS is diff from FS=" "
default includes \s+
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AWK looping though fields casperdaghost Linux - Newbie 10 12-31-2011 09:31 AM
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
[SOLVED] get fields using awk ashok.g Programming 9 12-09-2009 01:21 AM
modify all fields in awk tostay2003 Programming 16 08-09-2008 01:41 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration