LinuxQuestions.org - [SOLVED] Parse/rewrite file help

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Parse/rewrite file help (https://www.linuxquestions.org/questions/programming-9/parse-rewrite-file-help-910513/)

Parse/rewrite file help

rhel 5.7 bash

need to convert a file like this. basically just need to identfy the start of a line with "ID", read that line into a variable, then output. continue to loop until eof.

$X,$X
$X,data from line 2
$X,data from line 3
$X,date from line 4

then reset variable when "ID" is found again, the repeat, etc.

xxxxx, yyyyy, zzzzz, here is just random data, but it is a whole line.

(input file)

ID = xyz name = abc
xxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band
xxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzz

(desired output file)
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,xxxxxxxxxxxxxxxxxxxxxxxx
ID = xyz name = abc,yyyyyyyyyyyyyyyyyyyyyyyy
ID = xyz name = abc,zzzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band,ID = THE name = band
ID = THE name = band,xxxxxxxxxxxxxxxxxxxxxxxx
ID = THE name = band,yyyyyyyyyyyyyyyyyyyyyyyy
ID = THE name = band,zzzzzzzzzzzzzzzzzzzzzzzz

It's easier to use awk than just bash

cat in
ID = xyz name = abc
1xxxxxxxxxxxxxxxxxxxxxxx
1yyyyyyyyyyyyyyyyyyyyyyy
1zzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band
2xxxxxxxxxxxxxxxxxxxxxxx
2yyyyyyyyyyyyyyyyyyyyyyy
2zzzzzzzzzzzzzzzzzzzzzzz

awk '/^ID/ {id=$0} {printf "%s,%s\n",id,$0}' <in
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,1xxxxxxxxxxxxxxxxxxxxxxx
ID = xyz name = abc,1yyyyyyyyyyyyyyyyyyyyyyy
ID = xyz name = abc,1zzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band,ID = THE name = band
ID = THE name = band,2xxxxxxxxxxxxxxxxxxxxxxx
ID = THE name = band,2yyyyyyyyyyyyyyyyyyyyyyy
ID = THE name = band,2zzzzzzzzzzzzzzzzzzzzzzz

Could make it even simpler:

Code:

awk '/^ID/ {id=$0}$0 = id","$0' file

thnx, i will give this a try.

ok, small issue.

i forgot to say that the input file has lines before "^ID" and those need to be skipped (# of lines is random). input file also has blank lines in random places, so i need to also skip any blank lines.

as example:

(input file)

Code:

junk filler random data yada yada yada

junk filler // nothing random data yada yada yada

junk filler random data \\ sky is blue yada yada yada

junk filler **(texas should have won) random data yada yada yada

ID = xyz name = abc

xxxxxxx xxxxxxx = xxxxxxxxxx

yyy = yyyyyyyyyy = yyyyyyyyyyy

zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz



ID = THE name = band

xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx

yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy

zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

using (awk '/^ID/ {id=$0}$0 = id","$0' file) this is what i get

Code:

,junk filler random data yada yada yada

,junk filler // nothing random data yada yada yada

,junk filler random data \\ sky is blue yada yada yada

,junk filler **(texas should have won) random data yada yada yada

ID = xyz name = abc,ID = xyz name = abc

ID = xyz name = abc,xxxxxxx xxxxxxx = xxxxxxxxxx

ID = xyz name = abc,yyy = yyyyyyyyyy = yyyyyyyyyyy

ID = xyz name = abc,zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz

ID = xyz name = abc,

ID = THE name = band,ID = THE name = band

ID = THE name = band,xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx

ID = THE name = band,yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy

ID = THE name = band,zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

Add condition "print only if id is set" to grail's awk command:

Code:

awk '/^ID/ {id=$0} id && $0 = id","$0' file

Quote:

Originally Posted by Nominal Animal (Post 4514998)

Add condition "print only if id is set" to grail's awk command:

Code:

awk '/^ID/ {id=$0} id && $0 = id","$0' file

i still get wrong output (it doesnt skip the blank lines)

Code:

[root@host ~]$ more test3.txt

junk filler random data yada yada yada

junk filler // nothing random data yada yada yada

junk filler random data \\ sky is blue yada yada yada

junk filler **(texas should have won) random data yada yada yada

ID = xyz name = abc

xxxxxxx xxxxxxx = xxxxxxxxxx

yyy = yyyyyyyyyy = yyyyyyyyyyy

zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz



ID = THE name = band

xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx

yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy

zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz



[root@host ~]$ awk '/^ID/ {id=$0} id && $0 = id","$0' test3.txt |more

ID = xyz name = abc,ID = xyz name = abc

ID = xyz name = abc,xxxxxxx xxxxxxx = xxxxxxxxxx

ID = xyz name = abc,yyy = yyyyyyyyyy = yyyyyyyyyyy

ID = xyz name = abc,zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz

ID = xyz name = abc,

ID = THE name = band,ID = THE name = band

ID = THE name = band,xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx

ID = THE name = band,yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy

ID = THE name = band,zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

i was trying someing like this, but cant get the ELSE part to work as desired (that is, know if "ID" was already found by using variable)

Code:

#!/bin/awk -f

BEGIN {

OFS=",";

}

{

        if ( $1 == "ID" ) {

        id=$0;

        print $0,$0;}

        else {

                if (id contains "ID") {

                print id,$0;}

            }

}

i did some playing around with awk script, came up with this. it ignores all lines up until it finds $1=ID, and also ignores empty lines. seems to work ok. i will need to manually chop out of the output file a few lines at the end since my input file has no definitive marker, but no big deal. any way to simplify?

Code:

#!/bin/awk -f

BEGIN {

OFS="|";

}

{

        if ( NF == 0 ) {}

        else {

        if ( $1 == "ID" ) {

        id2=$0;

        id1="true";

        print $0,$0;}

        else {

                if ( id1 != "true" ) {}

                else {

                        print id2,$0;}

                }

}

}

Is there more junk mixed in through the file after ID is discovered for the first time?
Are there always blank lines between IDs?

It helps if you can describe your input data more if we are to mold the solution.

Quote:

Originally Posted by grail (Post 4515433)

its a output file from CA Top Secret (mainframe report). i cannot find definitive patterns of blank lines or definitive markers between the junk and the 1st "ID". after the 1st ID the data flows ID-data-data-data ID-data-data ID-data-data-data-data-data, etc, with some blank lines in there. let me see if i can sanitize a portion of my real file (its a large file) and i will post it.
thnx.

i ended up with this.

Code:

#!/bin/awk -f

BEGIN {

OFS="|";

}

{

        if ( NF == 0 || $0 ~ /(pattren1)|(pattern2)|(pattern3)|(pattern4)|(pattern5)|(pattern6)|(pattern7)|(pattern8)|(pattern9)/ ) {}

        else {

        if ( $1 == "ID" ) {

        id2=$1$2$3;

        id1="true";

        print id2,id2;}

        else {

                if ( id1 != "true" ) {}

                else {

                        print id2,$0;}

                }

}

}

This seems overly complicated. Let me see if I understand the file structure:

1. Any amount of crap but definitely not the letters ID prior to the first invocation of ID

2. Once ID is found there will be lines of data to be prepended with the ID and a comma

3. There may also occur blank (you may need to confirm if blank means nothing but a newline or possible could be white space as well) lines after ID is found

So based on the above the idea is ALL lines must be printed irrelevant of data but any post ID string being found must have ID string and a comma inserted (correct?)

Code:

awk '/ID/{id = $0}id && NF{$0=id","$0}1' file

Quote:

Originally Posted by grail (Post 4515433)

grail,
here is raw source (sanitized fubar). i dont need anyting until 1st occurance of "ID", no blank lines, and i dont need "TSP0320I LIST FUNCTION SUCCESSFUL" near the end or anything after that, etc. notice i also skip the following (or similar crud)that is wedged between pages and/or ID's.

1COMPUTER ASSOCIATES ***** T S S C O M M A N D P R O C E S S O R ***** TSSSSSDB PAGE 2
CA-POT RET/VS 2.0 12/04/2010 11.36.04

my real source file is ~30k lines and has many many ID's with each ID having random # of lines associated with ID, etc. i dunno if the TS report can be created in different ways, but this is the raw source i have to work with.

can you get this into a one line awk? if so hats off to you. my code in post #11 does the job, but i like simpler if you can achieve that. thnx.

output is OFS="|"
and should look like this (the dots just mean continue on, etc)

output file

Code:

ID = TESTTEST|ID = TESTTEST

ID = TESTTEST|TYPE      = MASTER    SIZE      =    4352  BITS

ID = TESTTEST|FACILITY  = *ALL*

ID = TESTTEST|CREATED    = 07/25/01  LAST MOD  = 08/29/09  09:46

ID = TESTTEST|PROFILED  = PRFGTYU

.

.

.

ID = AKIM|ID = AKIM

ID = AKIM|TYPE      = CENTRAL  SIZE      =      512  BITS

ID = AKIM|FACILITY  = *ALL*

ID = AKIM|CREATED    = 08/29/09  LAST MOD  = 10/05/09  15:14

ID = AKIM|PROFILED  = PRFGTYU  PRFBATCH  PROFGEN

.

.

.

input file

Code:

1// JOB TSSLIST  ***  TSS INIT COMMANDS  ***                        DATE 12/04/2010, CLOCK 11/36/05

 // EXEC TSSSSSDB

 1S23D  PHASE TSSSSSDB IS TO BE FETCHED FROM CAISLIF.PRODUCT

1COMPUTER ASSOCIATES            ***** T S S  C O M M A N D  P R O C E S S O R *****            TSSSSSDB    PAGE    1

 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04

  

 *------------------------------------------------------------------------------*

  

 TSS LIST(BASICS) DATA(ALL)

  

 TSS LIST(BASICS) DATA(ALL)

 ID = TESTTEST      NAME      = MASTER SECURITY

 TYPE      = MASTER    SIZE      =    4352  BITS

 FACILITY  = *ALL*

 CREATED    = 07/25/01  LAST MOD  = 08/29/09  09:46

 PROFILED  = PRFGTYU

 ATTRIBUTES = TTY1

 LAST USED  = 08/30/09 16:19 CPU(VPEA) FAC(ICDG    ) COUNT(16381)

 VSESLIB    = FJSWSRS.                    FARTSY.

 VSESLIB    = PRO1.                      PRO2.

 DATASET    = *****

 VOLUMES    = *ALL*(G)

 DCT        = *ALL*

 FCT        = *ALL*

 JCT        = *ALL*

 MODE      = WARN

 OTRAN      = DITT

 PANEL      = REXX

 PPT        = *ALL*

 TERMINAL  = *ALL*      K

 TST        = *ALL*

 XA VSELIB  = VSE.LIBRARY.BLABBER                            OWNER(IRMST  )

    ACCESS  = SOME

 XA VSELIB  = VSE.FARTSY.LIBRARY.BLABBER                    OWNER(IRMST  )

    ACCESS  = SOME

 XA VSELIB  = VSE.PRO1.LIBRARY.PRO1                        OWNER(IRMST  )

    ACCESS  = SOME

 XA VSELIB  = VSE.PRO2.LIBRARY.PRO2                        OWNER(IRMST  )

    ACCESS  = SOME

 XA VSELIB  = VSE.SYSRES.LIBRARY.IJSYSRS                    OWNER(IRMST  )

    ACCESS  = SOME

 XA DATASET = VSE                                          OWNER(IRMST  )

    ACCESS  = SOME

 XA MODE    = FAIL                                          OWNER(IRMST  )

 XA OTRAN  = *ALL*                                        OWNER(IRMST  )

    ACCESS  = SOME

 XA OTRAN  = MD                                            OWNER(IRMST  )

    ACCESS  = SOME

 XA OTRAN  = TSS                                          OWNER(IRMST  )

    ACCESS  = SOME

 -----------  SEGMENT CIPS

 OPIDENT    = AST

 BASICS      = AKIM    -SC SAM    -SC CIPS    (D) CLIMENT(V)

              CLEN    -SC COMM(V) ERTT    -SC GNIT    -SC

              IAUPTRUD-SC JOTC    -SC JTAO    -SC JVOL    -SC

              BLABBER  (Z) TESTY  (D)

  

 ID = AKIM      NAME      = VALUE ADD

1COMPUTER ASSOCIATES            ***** T S S  C O M M A N D  P R O C E S S O R *****            TSSSSSDB    PAGE    2

 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04

  

 TYPE      = CENTRAL  SIZE      =      512  BITS

 FACILITY  = *ALL*

 CREATED    = 08/29/09  LAST MOD  = 10/05/09  15:14

 PROFILED  = PRFGTYU  PRFBATCH  PROFGEN

 ATTRIBUTES = TTY1,VSECATBT,VSERDDIR,VSESYSAD,VSEMCON

 LAST USED  = 10/05/09 15:14 CPU(VPEA) FAC(ICDG    ) COUNT(00177)

 -----------  SEGMENT CIPS

 OPIDENT    = OPD

 -----------  SEGMENT IESIS

 IESFL1    = BAS,COD,VSAT

 IESFL2    = BQS,ESC,CSU,CSD,OSPD,XSM

 IESINIT    = IPSEABH

 IESTYPE    = USERTYPE2,NEW,SELECT

 IESVCAT    = TESTCAP

 -----------  ADMINISTRATION AUTHORITIES

 RESOURCE  = XAUTH,INFO

    ACCESS  = SOME

 ECID      = *ALL*

 FACILITIES = *ALL*

 LIST DATA  = *ALL*,PROFILED,PASSFOO

 MISC1      = SUSPEND

 MISC8      = LISTSTC,LISTRDT,REMASUSP,MCS

  

 ID = SAM      NAME      = SAM WALBERG

 TYPE      = CENTRAL  SIZE      =      512  BITS

 FACILITY  = *ALL*

 CREATED    = 04/17/07  LAST MOD  = 08/01/09  11:50

 PROFILED  = PRFGTYU  PRFBATCH  PROFGEN

 ATTRIBUTES = TTY1,VSECATBT,VSERDDIR,VSESYSAD,VSEMCON

 LAST USED  = 09/28/09 01:59 CPU(VPEA) FAC(BATCH  ) COUNT(04165)

 -----------  SEGMENT CIPS

 OPIDENT    = OP5

 -----------  SEGMENT IESIS

 IESFL1    = BAT,COD,VSAM

 IESFL2    = BQA,ESC,COU,CMD,OLPD,XRM

 IESINIT    = IESEADM

 IESTYPE    = USERTYPE14,NEW,SELECT

 IESVCAT    = TESTCAP

 -----------  ADMINISTRATION AUTHORITIES

 RESOURCE  = *ALL*

    ACCESS  = SOME

 ECID      = *ALL*

 FACILITIES = *ALL*

 LIST DATA  = *ALL*,PROFILED,PASSFOO

 MISC1      = *ALL*

 MISC2      = *ALL*

 MISC3      = *ALL*

 MISC8      = LISTSTC,LISTRDT,REMASUSP,MCS,LISTSDT

 MISC9      = *ALL*



ID = RRQMTTTT  NAME      = RQAMT TEST USER

 TYPE      = USER      SIZE      =      512  BITS

 FACILITY  = TEST

 DEPT ECID  = TESTY    DEPARTMENT = TEST USERS

 CREATED    = 07/13/06  LAST MOD  = 08/24/06  15:29

 PROFILED  = PRFRRQMT  PRFBATCH

 LAST USED  = 07/14/06 13:33 CPU(VSEB) FAC(BATCH  ) COUNT(00012)

 XA VSESLIB = DB3LIPS.TESTBTCH                              OWNER(IRMST  )

    ACCESS  = READ

 XA OTRAN  = CEDF                                          OWNER(IRMST  )

    ACCESS  = EXECUTE

 -----------  SEGMENT CIPS

 OPIDENT    = BBJDS

  

 TSP0320I  LIST    FUNCTION SUCCESSFUL

  

  

 *------------------------------------------------------------------------------*

  

  

1COMPUTER ASSOCIATES            ***** T S S  C O M M A N D  P R O C E S S O R *****            TSSSSSDB    PAGE  444

 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04

  

 TSS INPUT STATEMENTS READ              1

 TSS COMMANDS PROCESSED                1

 TSS BATCH ENVIRONMENT ERRORS          0

 TSS COMMAND ERRORS                    0

1EOJ TSSLIST                                                        DATE 12/04/2010, CLOCK 11/36/20, DURATION  00/00/15

This should do the job, and also split each fact on a separate line. It does a bit more than you asked for, but I guess this is what I'd start with.

Code:

tr -s '\r\n' '\n\n' < infile | sed -e 's|[\t\v\f ]*=[\t\v\f ]*|=|g; s|[\t\v\f ][\t\v\f ]\+\([^\t\v\f =]\+\([\t\v\f ][^\t\v\f =]\+\)*=\)|\n\1|g' | sed -e 's|^[\t\v\f ]\+||; s|[\t\v\f ]\+$||; s|[\t\v\f ]\+)|)|g; s|[\t\v\f ][\t\v\f ]\+| |g' | awk '/^ID=/ { id=$0 ; print id ; next } /=/ && length(id) { print id "|" $0 }' >outfile

The tr converts all newline conventions to standard Unix newlines.

The first sed removes whitespace around equals signs. Also, if there are multiple consecutive whitespaces, followed by some term (which may contain nonconsecutive whitespaces) and an equals sign, it splits the line at the consecutive whitespace. This makes sure each fact is on its own line.

The second sed removes leading and trailing whitespace, all whitespace before a close parenthesis, and combines multiple consecutive whitespaces into one. (Because the first sed introduces new newlines, I find it is easiest to flatten the data stream by using a separate sed command. It makes it easier to develop such long pipe stanzas.)

The awk part picks the ID values (also printing them alone), and for any line containing an equals sign, prints the id and the line. If you do not need the ID alone, just omit the first print .

If the input may contain pipes, I recommend prepending s/|/!/g; to the first sed pattern.

If you prefer the whitespace around = and |, add | sed -e 's/$[|=]$/ \1 /g' just before the >outfile .

To see which input lines are ignored/omitted by the above command, replace the end, starting at awk, with grep -v -e '=' -e '^[\t\v\f ]*$'

Nominal,
i will try that. thnx.