Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
04-19-2012, 10:43 AM
|
#1
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Rep:
|
Awk with missing fields
so, another script for me using gnu awk v3.1.5.
i have input files where NF may vary, and, fields $7 $9 and $10 may be blank (hence it looks like one big space between $8 and $11). how to handle this in awk?
the reason why NF may vary is because one field in the file may look like this "name=xyz" or "name=joe doe"
|
|
|
04-19-2012, 11:07 AM
|
#2
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709
|
Hi.
If your fields are separated by a single space, you can do
Code:
$ echo '1 2 3 4' | awk -F'[ ]' '{print "["$4"]"}'
[3]
If you would set FS='<space>' then, by convention, field separator is one or more whitespace.
|
|
|
04-19-2012, 11:08 AM
|
#3
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852
|
Can you give an example of the input you have and output you want?
Are all the fields in double quotes?
I think setting the FS variable in awk intelligently may do most of the work for you.
|
|
|
04-19-2012, 11:53 AM
|
#4
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Most likely (and hopefully) the actual field separator in the input file is TAB. Please post it (or part of it) as requested, using CODE tags to preserve spacing. Thank you.
|
|
|
04-19-2012, 12:58 PM
|
#5
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
ok, here's sample from windows txt file. i didnt look at it in hex yet.
some knowns about the data:
what seems consistent (they always exist, referencing my output below) is $1$2$3$4$5$6$8$11$12
...and $11 $12 always start with a *
Code:
11/28/11 06:52:15 PEEL BANNANA PRD 2 F APHSIP *1C*-99 INI NAME=CN 53510 SYS
11/28/11 06:52:15 PEEL ORANGE PRD 2 F APHSIP *1C*-99 INI NAME=CN 53510 IC
11/28/11 06:52:15 PEEL APPLE PRD 2 F APHSIP *1C*-99 INI NAME=CN 53510 NET
11/28/11 08:03:46 PEEL FRUIT PRD 2 F 01 APHSIP *08*-09 INI NAME=joe doe 53510 M058
11/28/11 09:31:17 PEEL GRAPES KRD 2 F 01 APHSIP EXECUTE NONE *08*-88 > DTPI 53510 M071
output is 16 fields pipe delimited, like this:
Code:
11/28/11|06:52:15|PEEL|BANNANA|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|SYS
11/28/11|06:52:15|PEEL|ORANGE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|IC
11/28/11|06:52:15|PEEL|APPLE|PRD|2 F||APHSIP|||*1C|*-99|INI|NAME=CN|53510|NET
11/28/11|08:03:46|PEEL|FRUIT|PRD|2 F|01|APHSIP|||*08|*-09|INI|NAME=joe doe|53510|M058
11/28/11|09:31:17|PEEL|GRAPES|KRD|2 F|01|APHSIP|EXECUTE|NONE|*08|*-88|>|DTPI|53510|M071
Last edited by Linux_Kidd; 04-19-2012 at 01:08 PM.
|
|
|
04-19-2012, 01:18 PM
|
#6
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
firstfire did put you on the right path but your data is not uniform. If we could assume that a space (or anything for that matter) were the delimiter then $5 which you have said is always
there and demonstrated that in the example is either PRD of KRD and yet, there is no consistency to a delimiter. In my opinion this makes it a lot more difficult. I would suggest that
you are now left with saying that each field has a specific length (ie. field 1 is 8 characters long) and trying to split the data based on this principle.
|
|
|
04-19-2012, 01:31 PM
|
#7
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
yep, this is a pita, certainly a good newb problem to solve, but its just ascii(hex) and we can always manipulate that, etc. this is the data to work with, the only fields that have consistent constant length (if they exist) is $1 $2 $6 $11 and $12 (referencing my output), all others can vary in length, etc.
i'll ask if the txt files can be generated using a better delimiter like a "|" char.
i could sed the data 1st, replacing every \s+ with \s, but then how to determine which fields are actually missing? i'm just trying to help the crew here turn another human-heavy process into a automated one, really has nothing to do with security, if i cant solve it today i'll likely leave it for someone else, which likely means it will remain a human-heavy process.
Last edited by Linux_Kidd; 04-19-2012 at 01:40 PM.
|
|
|
04-19-2012, 01:42 PM
|
#8
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
gawk, from v4 (I believe), has a new FPAT/patsplit feature, for defining fields based on regex patterns. It might help you work this one out.
http://www.gnu.org/software/gawk/man...ing-By-Content
|
|
|
04-19-2012, 01:56 PM
|
#9
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
thnx Dave,
stuck with gnu 3.1.5 for now
well, if i sed -i 's/\s/|/g' my data file i get something that may be usable as the new output has constant NF, and data is in predictable field locations.
i should be able to get it to work, need to sed 1st, then awk it. will let you know.
as example, i can do a if statement like "if $10="" then print $9,$13 else print $10,$11
tricky, but seems doable.
after sed:
Code:
|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
S30
|11/25/11|07:01:17|PEEL|CT|||||PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=CT||||||||||||||||||||47736|
VMC
|11/25/11|07:46:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N438
|11/25/11|07:57:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*0C*-0A|INI|||NAME=xxx|yyy||||||||47736|
N396
|11/25/11|08:00:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:00:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
N309
|11/25/11|08:06:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
M366
|11/25/11|08:09:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||47736|
N430
|11/25/11|08:11:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||||47736|
M118
|11/25/11|08:12:48|PEEL|REM|REM|2|F|01|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:15:48|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy|||||||||47736|
N455
|11/25/11|08:22:18|PEEL|REM|REM|2|F|02|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:28:18|PEEL|REM|REM|2|F|03|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
|11/25/11|08:32:18|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*1C*-46|INI|||NAME=REM||||||||||||||||47736|
M398
|11/25/11|08:40:18|PEEL|REM|PRD|2|F|01|SIP|||||||||||||||||||||*08*-09|INI|||NAME=xxx|yyy||||||||47736|
M236
|11/25/11|08:40:48|PEEL|REM|PRD|2|F||||SIP|||||||||||||||||||||*10*-0B|INI|||NAME=xxx|yyy||||||47736|
N075
|11/25/11|08:41:48|PEEL|REM|REM|2|F|04|SIP|||EXECUTE||NONE|||||*08*-88|||||>|ANN|||||||||||||||||||||||||47736|
N432
Last edited by Linux_Kidd; 04-19-2012 at 02:21 PM.
|
|
|
04-19-2012, 03:28 PM
|
#10
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
The sed is not required as this would be the same as the solution provided by firstfire, ie by using FS = " " then this is now equivalent to what you created with sed.
|
|
|
04-19-2012, 03:36 PM
|
#11
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
FS of space is \s+ (is this correct, one or 40 spaces is considered a single FS?)
and this would not produce the same NF as sed did.
sed at least gave me constant NF
some fields seem to be predictable in max size, 8char max. i am off for a few days, will look at it next week. thnx.
Last edited by Linux_Kidd; 04-19-2012 at 03:37 PM.
|
|
|
04-19-2012, 03:56 PM
|
#12
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
No, not \s+, but the exact example I provided:
That is a single space between the quotes.
|
|
|
04-19-2012, 04:27 PM
|
#13
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852
|
This is a very clumsy and ugly solution, but you may be able to create regex that matches the line. For example, an expression like this one could work for your input as in post #5:
Code:
perl -pe 's/\s*([0-9\/]+)\s+([0-9:]+)\s+(\w+)\s+(\w+)\s+([PK]RD)\s+2 F\s+([0-9]+)*\s+APHSIP\s+(EXECUTE)*\s+(NONE)*\s+(\*..)\s*(\*...\s+(\S+)\s+(NAME=[^0-9]+)*\s*(\w)+\s+(\w+).*/$1|$2|$3|$4|$5|2 F|$6|APHSIP|$7|$8|$9|$10|$11|$12|$13|$14/'|sed -r 's/\s+\|/|/g'
where I assumed, that:
1) $5 is either PRD or KRD
2) $7 is a number or blank
3) $8 is allways APHSIP
4) $9 and $10 are EXECUTE and NONE or blank
5) $14 doesn't contain digits
6) #15 is a number or at least starts with a digit
you may need to make the expression more general based on what you know about the input. Unless you can make the input file mmore regular and predictable, it will be very difficult to find an efficient and reliable solution.
|
|
|
04-19-2012, 07:36 PM
|
#14
|
Member
Registered: Oct 2011
Posts: 75
Rep: 
|
Sorry, I posted in the wrong thread!
Last edited by fantasy1215; 04-19-2012 at 07:37 PM.
|
|
|
04-19-2012, 08:24 PM
|
#15
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
Quote:
Originally Posted by grail
No, not \s+, but the exact example I provided:
That is a single space between the quotes.
|
hmmm, so default FS is diff from FS=" "
default includes \s+
|
|
|
All times are GMT -5. The time now is 10:34 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|