Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
04-20-2012, 02:15 AM
|
#16
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
hmmm ... well looks like I learnt something today too ... if you set it to a single space as I have done it still uses the default 
However, again back to firstfire who has the correct format:
To see in action we can generate the same output as the sed using:
Code:
awk -F"[ ]" '$NF=$NF' OFS="|" file
Normally I would use $1=$1, but as the first field is blank on all examples this does not work so the only assurance we have is that there will be a final field
I found with the current example though it still not particularly useful as the fields do very:
Code:
$ awk 'BEGIN{FS="[ ]";OFS="|"}{print NF}' testfile
62
62
62
42
52
So you can try and play with that or as suggested by millgates, try and find a regex that matches what a line looks like.
|
|
|
04-20-2012, 02:26 AM
|
#17
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,673
|
I would parse the original file from right to left, if APHSIP will always there you can do it easily.
$NF, $(NF-1) can be identified easily. Cut it. Next step is to find *digits*something and split into 4 fields, and finally you can split the string from the beginning.
|
|
|
04-20-2012, 02:59 AM
|
#18
|
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
It looks to me that you started with a tab separated file, and the tabs were expanded to spaces Either earlier or when you posted. In fields where the entry crosses a tab stop, the next field was printed on the next tab stop, resulting in the uneven appearance. Look at a line with "od" and see if that is the case. It will be much easier to retain the tabs than work with a file that has the extra spaces.
|
|
|
04-25-2012, 07:08 AM
|
#19
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
its all space delimited.
a new approach is to use OFS on every char, then i can group my needed output fields by field #'s, like "date = $1$2$3$4$5$6$7$8" "time = $9$10$11$12$13$14$15$16" etc, then print out these variables. i found one oddity in field #'s but i can use a if statement to find that, etc. i'll post my solution.
Last edited by Linux_Kidd; 04-25-2012 at 07:30 AM.
Reason: new approach
|
|
|
04-25-2012, 07:42 AM
|
#20
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852
|
May I ask what program / script generated your input file and if possible, can you show us the code responsible for its format? Knowing the algorithm of how the input file was generated could help us make some assumptions about the format that will make the job easier.
|
|
|
04-25-2012, 08:35 AM
|
#21
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
the report(s) comes from CA Top Secret. i dont knw if CA TS reports can be outputted different ways, but currently i get these reports from existing report crons so they will not be changed. i clean up the data and normalize the output to a constant number of chars per line like this:
Code:
#!/bin/bash
# written by me
# this script is called from "/var/scripts/audit.sh"
awk '
BEGIN {
OFS="|";
FS=" ";
}
{
if ( NF == 0 || $1 ~ /^(poop|fun|[-+=]|0$|1\/|\/\/|TS|PASSWORD|DYNAMIC|1E)/ ) {}
else {
print $0;}
} ' | sed 's/^0\(.*\)/\1/' | sed 's/\s/|/g' | sed 's/^|\(.*\)/\1/'
i will then replace every non pipe char with char| and then run that back through awk, grouping my desired "fields" into variables, then print out the variables, etc.
this should work rather nicely.
|
|
|
04-25-2012, 09:58 AM
|
#22
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
Well my only addition would be that the seds are not needed as awk can do all the tasks.
|
|
|
04-25-2012, 10:08 AM
|
#23
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
yeah, but i am in debug mode so having seperate sed's helps me find formatting issues. i'll post my solution, then you can elegant it. thnx.
|
|
|
04-25-2012, 01:25 PM
|
#24
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
ok, so i got it down to this. works like a charm, however, the very last line of the output writes out 16 blank fields with pipe delimiter, and i dont know why. the file looks like the last line is empty, hexdump shows this as last two entries:
Code:
0007070 2f31 3033 0a0a
0007076
Code:
#!/bin/bash
awk '
BEGIN {
OFS="|";
}
{
if ( NF == 0 || $0 ~ /(\/\/|match junk to be ignored)/ ) {}
else {
print $0 ;}
} ' | sed 's/^[|0]\(.*\)/\1/' | sed 's/\s/|/g' | sed 's/^|\(.*\)/\1/' | sed 's/[^|]/&|/g' |
awk '
BEGIN {
OFS="|";
FS="|";
}
{
space = " "
date = $1$2$3$4$5$6$7$8
time = $10$11$12$13$14$15$16$17
field3 = $19$20$21$22
field4 = $24$25$26$27$28$29$30$31
field5 = $33$34$35$36$37$38$39$40
field6 = $42space$44
field7 = $46$47
field8 = $49$50$51$52$53$54$55$56
field9 = $58$59$60$61$62$63$64$65
field10 = $67$68$69$70$71$72$73$74
field11 = $76$77$78
field12 = $79$80$81$82
field13 = $84$85$86$87$88
field14 = $90$91$92$93$94$95$96$97$98$99$100$101$102$103$104$105$106$107$108$109$110$111$112$113$114$115$116$117
field15 = $119$120$121$122$123
field16 = $125$126$127$128
print date,time,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16;
} '
Last edited by Linux_Kidd; 04-25-2012 at 01:30 PM.
|
|
|
04-25-2012, 01:46 PM
|
#25
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
Are you sure the last line is empty and does not contain spaces or maybe tabs?
Basically the last awk will output 16 fields no matter what the data is as there is no test for it not to print.
So the only conclusion I can draw is that the previous awk / sed combination is producing a blank line for the last one.
Of course a simple solution is to put NF before opening brace in second awk then only lines with data will be processed.
|
|
|
04-25-2012, 01:53 PM
|
#26
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
but how would anything get past the 1st awk "if NF == 0" ? nothing should get piped if NF = 0 (doesnt a blank line mean NF = 0 ??)
and, you mean add a "if NF == 0" to the 2nd awk?
edit: so i added a "if NF == 0" to 2nd awk, that fixes the blank fields issue, but where does the blank line come from?
Last edited by Linux_Kidd; 04-25-2012 at 02:02 PM.
|
|
|
04-26-2012, 04:52 AM
|
#27
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
Well I do not know the data all that well but is it possible that you have a line with perhaps a 0 on it which the seds would then remove?
I would like to bring back a previous question I had about separating the field by length. Previously you said this might not be possible as fields may be of alternating lengths.
My issue now is that by using every field and sticking them together as you have done, you have now basically said that all the fields are represented as you use a finite number
of fields to create your 16 fields.
I do notice that you skip some fields (not sure how you know they can be removed?), but ultimately you use all available fields from 1 to 128.
Maybe this is something you can reinvestigate to make it simpler?
Oh and my suggestion was to simply put NF prior to last set of braces:
Code:
NF{
space = " "
date = $1$2$3$4$5$6$7$8
...
Last edited by grail; 04-26-2012 at 04:53 AM.
|
|
|
04-26-2012, 07:22 AM
|
#28
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
ok, it seems that the report file fields are written to fixed width fields separated by a single space. if the field data does not take up the full width it is padded using space (hex 20) and can be padded before and/or after the data (as example, one field is 5 char wide, but in some cases the field is populated with h20h20>h20h20 while in other cases it is h20INAh20). this observation only after substituting every space with | and every char with char|, this revealed what looks like data fields of fixed width but the data itself may not take up the full width, etc. the initial awk did not reveal this in which one of the fields was "poop" while a few lines later the same field was "bananna", etc.
the field #'s that i skip are the hex20's that separate each data field. i looked at the raw data and the data as it passes through the script and i dont see anywhere where there is a 0 by itself which would render a blank line by sed.
since what i have works well i'll leave it as-is, but i am sure it could be made simpler.
Last edited by Linux_Kidd; 04-26-2012 at 07:27 AM.
|
|
|
04-26-2012, 10:18 AM
|
#29
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
If you want to tell me the field lengths and also which fields always exist I can produce a much easier way (I think)
I assume the data in post #5 is still good to test on?
Also, if there is any chance, I fully recommend upgrading to version 4+.
|
|
|
04-26-2012, 11:01 AM
|
#30
|
Member
Registered: Jan 2006
Location: USA
Posts: 746
Original Poster
Rep:
|
real sanitized data. the 1st line of data starts with a zero that needs to be stripped off, and all other lines start with h20 (space). in my script i strip out the zero and space before parsing the fields, etc.
Code:
008/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 SYP
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 ID
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 NEP
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 REXREPLY
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 REXTSYSA
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 REXT
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 S31
08/18/11 06:54:26 BEAS RSNL DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=RSNL 64476 VMD
08/18/11 07:53:27 BEAS REFGNVXJ DOCPRUSS 2 F PLATJX *1D*-49 PDD NAME=REFGNVXJ 64476 M078
08/18/11 07:53:57 BEAS ACTRYTXT DOCPRUSS 2 F PLATJX *0C*-0A PDD NAME=PAHANY JAWFIK 64476 N329
08/18/11 08:05:27 BEAS SVBENLXX DOCPRUSS 2 F 01 PLATJX *08*-09 PDD NAME=LIPING XU 64476 N238
08/18/11 08:09:57 BEAS RETELALA DOCPRUSS 2 F 01 PLATJX *08*-09 PDD NAME=ABEL TNIS 64476 N013
08/18/11 08:12:57 BEAS LOANSSXS DOCPRUSS 2 F 01 PLATJX *08*-09 PDD NAME=STACEY RENCEEP 64476 N424
08/18/11 08:17:57 BEAS CCNTRRXP CCNTRRXP 2 F 01 PLATJX EXECUTE NONE *08*-88 > CAPN 64476 M113
i do not know which fields have data in them, i only know the field width by which data can occupy.
so, i can only tell you what the field lengths are after stripping 0 and space as i said above, then replace every space with | and every char with char|, i believe this yields constant NF=132.
so the 16 output fields i need are:
(F means field, and there is a h20 between each of the 16 except at #6 where that space is retained in my field, etc)
Code:
1 = F1-F8 8char
2 = F10-F17 8char
3 = F19-F22 4char
4 = F24-F31 8char
5 = F33-F40 8char
6 = F42 h20 F44 2char
7 = F46-F47 2char
8 = F49-F56 8char
9 = F58-F65 8char
10= F67-F74 8char
11= F76-F78 3char
12= F79-F82 4char
13= F84-F88 5char
14= F90-F117 28char
15= F119-F123 5char
16= F125-F132 8char
NOTE#1: i just spotted an error, the last field can be 8char wide, i originally said 4char.
NOTE#2: the NAME field in my script is concatanated because names that use two strings are not predictable in length, this field can contain multiple strings for the actual name, etc.
Last edited by Linux_Kidd; 04-26-2012 at 11:37 AM.
|
|
|
All times are GMT -5. The time now is 09:14 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|