LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-20-2012, 02:15 AM   #16
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194

hmmm ... well looks like I learnt something today too ... if you set it to a single space as I have done it still uses the default
However, again back to firstfire who has the correct format:
Code:
FS="[ ]"
To see in action we can generate the same output as the sed using:
Code:
awk -F"[ ]" '$NF=$NF' OFS="|" file
Normally I would use $1=$1, but as the first field is blank on all examples this does not work so the only assurance we have is that there will be a final field

I found with the current example though it still not particularly useful as the fields do very:
Code:
$ awk 'BEGIN{FS="[ ]";OFS="|"}{print NF}' testfile
62
62
62
42
52
So you can try and play with that or as suggested by millgates, try and find a regex that matches what a line looks like.
 
Old 04-20-2012, 02:26 AM   #17
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,116

Rep: Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368Reputation: 7368
I would parse the original file from right to left, if APHSIP will always there you can do it easily.
$NF, $(NF-1) can be identified easily. Cut it. Next step is to find *digits*something and split into 4 fields, and finally you can split the string from the beginning.
 
Old 04-20-2012, 02:59 AM   #18
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
It looks to me that you started with a tab separated file, and the tabs were expanded to spaces Either earlier or when you posted. In fields where the entry crosses a tab stop, the next field was printed on the next tab stop, resulting in the uneven appearance. Look at a line with "od" and see if that is the case. It will be much easier to retain the tabs than work with a file that has the extra spaces.
 
Old 04-25-2012, 07:08 AM   #19
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
its all space delimited.


a new approach is to use OFS on every char, then i can group my needed output fields by field #'s, like "date = $1$2$3$4$5$6$7$8" "time = $9$10$11$12$13$14$15$16" etc, then print out these variables. i found one oddity in field #'s but i can use a if statement to find that, etc. i'll post my solution.

Last edited by Linux_Kidd; 04-25-2012 at 07:30 AM. Reason: new approach
 
Old 04-25-2012, 07:42 AM   #20
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
May I ask what program / script generated your input file and if possible, can you show us the code responsible for its format? Knowing the algorithm of how the input file was generated could help us make some assumptions about the format that will make the job easier.
 
Old 04-25-2012, 08:35 AM   #21
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
the report(s) comes from CA Top Secret. i dont knw if CA TS reports can be outputted different ways, but currently i get these reports from existing report crons so they will not be changed. i clean up the data and normalize the output to a constant number of chars per line like this:

Code:
#!/bin/bash
# written by me
# this script is called from "/var/scripts/audit.sh"
awk '
BEGIN {
OFS="|";
FS=" ";
}
{
        if ( NF == 0 || $1 ~ /^(poop|fun|[-+=]|0$|1\/|\/\/|TS|PASSWORD|DYNAMIC|1E)/ ) {}
        else {
        print $0;}
} ' | sed 's/^0\(.*\)/\1/' | sed 's/\s/|/g' | sed 's/^|\(.*\)/\1/'
i will then replace every non pipe char with char| and then run that back through awk, grouping my desired "fields" into variables, then print out the variables, etc.

this should work rather nicely.
 
Old 04-25-2012, 09:58 AM   #22
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Well my only addition would be that the seds are not needed as awk can do all the tasks.
 
Old 04-25-2012, 10:08 AM   #23
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
yeah, but i am in debug mode so having seperate sed's helps me find formatting issues. i'll post my solution, then you can elegant it. thnx.
 
Old 04-25-2012, 01:25 PM   #24
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
ok, so i got it down to this. works like a charm, however, the very last line of the output writes out 16 blank fields with pipe delimiter, and i dont know why. the file looks like the last line is empty, hexdump shows this as last two entries:
Code:
0007070 2f31 3033 0a0a
0007076
Code:
#!/bin/bash
awk '
BEGIN {
OFS="|";

}
{
        if ( NF == 0 || $0 ~ /(\/\/|match junk to be ignored)/ ) {}
        else {
        print $0 ;}
} ' | sed 's/^[|0]\(.*\)/\1/' | sed 's/\s/|/g' | sed 's/^|\(.*\)/\1/' | sed 's/[^|]/&|/g' |
awk '
BEGIN {
OFS="|";
FS="|";
}
{
space = " "
date = $1$2$3$4$5$6$7$8
time = $10$11$12$13$14$15$16$17
field3 = $19$20$21$22
field4 = $24$25$26$27$28$29$30$31
field5 = $33$34$35$36$37$38$39$40
field6 = $42space$44
field7 = $46$47
field8 = $49$50$51$52$53$54$55$56
field9 = $58$59$60$61$62$63$64$65
field10 = $67$68$69$70$71$72$73$74
field11 = $76$77$78
field12 = $79$80$81$82
field13 = $84$85$86$87$88
field14 = $90$91$92$93$94$95$96$97$98$99$100$101$102$103$104$105$106$107$108$109$110$111$112$113$114$115$116$117
field15 = $119$120$121$122$123
field16 = $125$126$127$128
print date,time,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16;
} '

Last edited by Linux_Kidd; 04-25-2012 at 01:30 PM.
 
Old 04-25-2012, 01:46 PM   #25
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Are you sure the last line is empty and does not contain spaces or maybe tabs?

Basically the last awk will output 16 fields no matter what the data is as there is no test for it not to print.
So the only conclusion I can draw is that the previous awk / sed combination is producing a blank line for the last one.

Of course a simple solution is to put NF before opening brace in second awk then only lines with data will be processed.
 
Old 04-25-2012, 01:53 PM   #26
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
but how would anything get past the 1st awk "if NF == 0" ? nothing should get piped if NF = 0 (doesnt a blank line mean NF = 0 ??)

and, you mean add a "if NF == 0" to the 2nd awk?

edit: so i added a "if NF == 0" to 2nd awk, that fixes the blank fields issue, but where does the blank line come from?

Last edited by Linux_Kidd; 04-25-2012 at 02:02 PM.
 
Old 04-26-2012, 04:52 AM   #27
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Well I do not know the data all that well but is it possible that you have a line with perhaps a 0 on it which the seds would then remove?

I would like to bring back a previous question I had about separating the field by length. Previously you said this might not be possible as fields may be of alternating lengths.
My issue now is that by using every field and sticking them together as you have done, you have now basically said that all the fields are represented as you use a finite number
of fields to create your 16 fields.

I do notice that you skip some fields (not sure how you know they can be removed?), but ultimately you use all available fields from 1 to 128.
Maybe this is something you can reinvestigate to make it simpler?

Oh and my suggestion was to simply put NF prior to last set of braces:
Code:
NF{
space = " "
date = $1$2$3$4$5$6$7$8
...

Last edited by grail; 04-26-2012 at 04:53 AM.
 
Old 04-26-2012, 07:22 AM   #28
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
ok, it seems that the report file fields are written to fixed width fields separated by a single space. if the field data does not take up the full width it is padded using space (hex 20) and can be padded before and/or after the data (as example, one field is 5 char wide, but in some cases the field is populated with h20h20>h20h20 while in other cases it is h20INAh20). this observation only after substituting every space with | and every char with char|, this revealed what looks like data fields of fixed width but the data itself may not take up the full width, etc. the initial awk did not reveal this in which one of the fields was "poop" while a few lines later the same field was "bananna", etc.

the field #'s that i skip are the hex20's that separate each data field. i looked at the raw data and the data as it passes through the script and i dont see anywhere where there is a 0 by itself which would render a blank line by sed.

since what i have works well i'll leave it as-is, but i am sure it could be made simpler.

Last edited by Linux_Kidd; 04-26-2012 at 07:27 AM.
 
Old 04-26-2012, 10:18 AM   #29
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
If you want to tell me the field lengths and also which fields always exist I can produce a much easier way (I think)

I assume the data in post #5 is still good to test on?

Also, if there is any chance, I fully recommend upgrading to version 4+.
 
Old 04-26-2012, 11:01 AM   #30
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
real sanitized data. the 1st line of data starts with a zero that needs to be stripped off, and all other lines start with h20 (space). in my script i strip out the zero and space before parsing the fields, etc.

Code:
008/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 SYP
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 ID
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 NEP
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 REXREPLY
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 REXTSYSA
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 REXT
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 S31
 08/18/11 06:54:26 BEAS RSNL     DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=RSNL                    64476 VMD
 08/18/11 07:53:27 BEAS REFGNVXJ DOCPRUSS 2 F    PLATJX                     *1D*-49 PDD   NAME=REFGNVXJ                64476 M078
 08/18/11 07:53:57 BEAS ACTRYTXT DOCPRUSS 2 F    PLATJX                     *0C*-0A PDD   NAME=PAHANY JAWFIK           64476 N329
 08/18/11 08:05:27 BEAS SVBENLXX DOCPRUSS 2 F 01 PLATJX                     *08*-09 PDD   NAME=LIPING XU               64476 N238
 08/18/11 08:09:57 BEAS RETELALA DOCPRUSS 2 F 01 PLATJX                     *08*-09 PDD   NAME=ABEL TNIS               64476 N013
 08/18/11 08:12:57 BEAS LOANSSXS DOCPRUSS 2 F 01 PLATJX                     *08*-09 PDD   NAME=STACEY RENCEEP          64476 N424
 08/18/11 08:17:57 BEAS CCNTRRXP CCNTRRXP 2 F 01 PLATJX   EXECUTE  NONE     *08*-88     > CAPN                         64476 M113
i do not know which fields have data in them, i only know the field width by which data can occupy.

so, i can only tell you what the field lengths are after stripping 0 and space as i said above, then replace every space with | and every char with char|, i believe this yields constant NF=132.

so the 16 output fields i need are:
(F means field, and there is a h20 between each of the 16 except at #6 where that space is retained in my field, etc)

Code:
1 = F1-F8 8char
2 = F10-F17 8char
3 = F19-F22 4char
4 = F24-F31 8char
5 = F33-F40 8char
6 = F42 h20 F44 2char
7 = F46-F47 2char
8 = F49-F56 8char
9 = F58-F65 8char
10= F67-F74 8char
11= F76-F78 3char
12= F79-F82 4char
13= F84-F88 5char
14= F90-F117 28char
15= F119-F123 5char
16= F125-F132 8char
NOTE#1: i just spotted an error, the last field can be 8char wide, i originally said 4char.
NOTE#2: the NAME field in my script is concatanated because names that use two strings are not predictable in length, this field can contain multiple strings for the actual name, etc.

Last edited by Linux_Kidd; 04-26-2012 at 11:37 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AWK looping though fields casperdaghost Linux - Newbie 10 12-31-2011 09:31 AM
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
[SOLVED] get fields using awk ashok.g Programming 9 12-09-2009 01:21 AM
modify all fields in awk tostay2003 Programming 16 08-09-2008 01:41 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration