[SOLVED] Parse/rewrite file help

Linux_Kidd · 10-27-2011, 01:18 PM

rhel 5.7 bash

need to convert a file like this. basically just need to identfy the start of a line with "ID", read that line into a variable, then output. continue to loop until eof.

$X,$X
$X,data from line 2
$X,data from line 3
$X,date from line 4

then reset variable when "ID" is found again, the repeat, etc.

xxxxx, yyyyy, zzzzz, here is just random data, but it is a whole line.

(input file)

ID = xyz name = abc
xxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band
xxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzz

(desired output file)
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,xxxxxxxxxxxxxxxxxxxxxxxx
ID = xyz name = abc,yyyyyyyyyyyyyyyyyyyyyyyy
ID = xyz name = abc,zzzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band,ID = THE name = band
ID = THE name = band,xxxxxxxxxxxxxxxxxxxxxxxx
ID = THE name = band,yyyyyyyyyyyyyyyyyyyyyyyy
ID = THE name = band,zzzzzzzzzzzzzzzzzzzzzzzz

smallpond · 10-27-2011, 03:42 PM

It's easier to use awk than just bash

cat in
ID = xyz name = abc
1xxxxxxxxxxxxxxxxxxxxxxx
1yyyyyyyyyyyyyyyyyyyyyyy
1zzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band
2xxxxxxxxxxxxxxxxxxxxxxx
2yyyyyyyyyyyyyyyyyyyyyyy
2zzzzzzzzzzzzzzzzzzzzzzz

awk '/^ID/ {id=$0} {printf "%s,%s\n",id,$0}' <in
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,1xxxxxxxxxxxxxxxxxxxxxxx
ID = xyz name = abc,1yyyyyyyyyyyyyyyyyyyyyyy
ID = xyz name = abc,1zzzzzzzzzzzzzzzzzzzzzzz
ID = THE name = band,ID = THE name = band
ID = THE name = band,2xxxxxxxxxxxxxxxxxxxxxxx
ID = THE name = band,2yyyyyyyyyyyyyyyyyyyyyyy
ID = THE name = band,2zzzzzzzzzzzzzzzzzzzzzzz

grail · 10-28-2011, 12:30 AM

Could make it even simpler:

Code:

awk '/^ID/ {id=$0}$0 = id","$0' file

Linux_Kidd · 10-28-2011, 08:39 AM

thnx, i will give this a try.

Linux_Kidd · 11-03-2011, 07:36 AM

ok, small issue.

i forgot to say that the input file has lines before "^ID" and those need to be skipped (# of lines is random). input file also has blank lines in random places, so i need to also skip any blank lines.

as example:

(input file)

Code:

junk filler random data yada yada yada
junk filler // nothing random data yada yada yada
junk filler random data \\ sky is blue yada yada yada
junk filler **(texas should have won) random data yada yada yada
ID = xyz name = abc
xxxxxxx xxxxxxx = xxxxxxxxxx
yyy = yyyyyyyyyy = yyyyyyyyyyy
zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz

ID = THE name = band
xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy
zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

using (awk '/^ID/ {id=$0}$0 = id","$0' file) this is what i get

Code:

,junk filler random data yada yada yada
,junk filler // nothing random data yada yada yada
,junk filler random data \\ sky is blue yada yada yada
,junk filler **(texas should have won) random data yada yada yada
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,xxxxxxx xxxxxxx = xxxxxxxxxx
ID = xyz name = abc,yyy = yyyyyyyyyy = yyyyyyyyyyy
ID = xyz name = abc,zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz
ID = xyz name = abc,
ID = THE name = band,ID = THE name = band
ID = THE name = band,xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx
ID = THE name = band,yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy
ID = THE name = band,zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

Nominal Animal · 11-03-2011, 11:43 AM

Add condition "print only if id is set" to grail's awk command:

Code:

awk '/^ID/ {id=$0} id && $0 = id","$0' file

Linux_Kidd · 11-03-2011, 12:26 PM

Quote:

Originally Posted by Nominal Animal

Add condition "print only if id is set" to grail's awk command:

Code:

awk '/^ID/ {id=$0} id && $0 = id","$0' file

i still get wrong output (it doesnt skip the blank lines)

Code:

[root@host ~]$ more test3.txt
junk filler random data yada yada yada
junk filler // nothing random data yada yada yada
junk filler random data \\ sky is blue yada yada yada
junk filler **(texas should have won) random data yada yada yada
ID = xyz name = abc
xxxxxxx xxxxxxx = xxxxxxxxxx
yyy = yyyyyyyyyy = yyyyyyyyyyy
zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz

ID = THE name = band
xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy
zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

[root@host ~]$ awk '/^ID/ {id=$0} id && $0 = id","$0' test3.txt |more
ID = xyz name = abc,ID = xyz name = abc
ID = xyz name = abc,xxxxxxx xxxxxxx = xxxxxxxxxx
ID = xyz name = abc,yyy = yyyyyyyyyy = yyyyyyyyyyy
ID = xyz name = abc,zzzzzzzzz = zzzzzzzz(YSYR) zzzzzzz
ID = xyz name = abc,
ID = THE name = band,ID = THE name = band
ID = THE name = band,xxxxx (YSTSTS) xxxxxxxxxxxxxxxxxxx
ID = THE name = band,yyyyyyyyyyyy\=(YHSGST) yyyyyyyyyyyy
ID = THE name = band,zzzzzz = 09/22/11 zzzzzzzzzzzzzzzzzz

i was trying someing like this, but cant get the ELSE part to work as desired (that is, know if "ID" was already found by using variable)

Code:

#!/bin/awk -f
BEGIN {
OFS=",";
}
{
        if ( $1 == "ID" ) {
        id=$0;
        print $0,$0;}
        else {
                if (id contains "ID") {
                print id,$0;}
             }
}

Linux_Kidd · 11-03-2011, 03:49 PM

i did some playing around with awk script, came up with this. it ignores all lines up until it finds $1=ID, and also ignores empty lines. seems to work ok. i will need to manually chop out of the output file a few lines at the end since my input file has no definitive marker, but no big deal. any way to simplify?

Code:

#!/bin/awk -f
BEGIN {
OFS="|";
}
{
        if ( NF == 0 ) {}
        else {
        if ( $1 == "ID" ) {
        id2=$0;
        id1="true";
        print $0,$0;}
        else {
                if ( id1 != "true" ) {}
                else {
                        print id2,$0;}
                }
}
}

grail · 11-04-2011, 12:15 AM

Is there more junk mixed in through the file after ID is discovered for the first time?
Are there always blank lines between IDs?

It helps if you can describe your input data more if we are to mold the solution.

Linux_Kidd · 11-04-2011, 05:28 AM

Quote:

Originally Posted by grail

Is there more junk mixed in through the file after ID is discovered for the first time?
Are there always blank lines between IDs?

It helps if you can describe your input data more if we are to mold the solution.

its a output file from CA Top Secret (mainframe report). i cannot find definitive patterns of blank lines or definitive markers between the junk and the 1st "ID". after the 1st ID the data flows ID-data-data-data ID-data-data ID-data-data-data-data-data, etc, with some blank lines in there. let me see if i can sanitize a portion of my real file (its a large file) and i will post it.
thnx.

Linux_Kidd · 11-04-2011, 10:34 AM

i ended up with this.

Code:

#!/bin/awk -f
BEGIN {
OFS="|";
}
{
        if ( NF == 0 || $0 ~ /(pattren1)|(pattern2)|(pattern3)|(pattern4)|(pattern5)|(pattern6)|(pattern7)|(pattern8)|(pattern9)/ ) {}
        else {
        if ( $1 == "ID" ) {
        id2=$1$2$3;
        id1="true";
        print id2,id2;}
        else {
                if ( id1 != "true" ) {}
                else {
                        print id2,$0;}
                }
}
}

grail · 11-04-2011, 10:44 AM

This seems overly complicated. Let me see if I understand the file structure:

1. Any amount of crap but definitely not the letters ID prior to the first invocation of ID

2. Once ID is found there will be lines of data to be prepended with the ID and a comma

3. There may also occur blank (you may need to confirm if blank means nothing but a newline or possible could be white space as well) lines after ID is found

So based on the above the idea is ALL lines must be printed irrelevant of data but any post ID string being found must have ID string and a comma inserted (correct?)

Code:

awk '/ID/{id = $0}id && NF{$0=id","$0}1' file

Linux_Kidd · 11-04-2011, 12:00 PM

Quote:

Originally Posted by grail

Is there more junk mixed in through the file after ID is discovered for the first time?
Are there always blank lines between IDs?

It helps if you can describe your input data more if we are to mold the solution.

grail,
here is raw source (sanitized fubar). i dont need anyting until 1st occurance of "ID", no blank lines, and i dont need "TSP0320I LIST FUNCTION SUCCESSFUL" near the end or anything after that, etc. notice i also skip the following (or similar crud)that is wedged between pages and/or ID's.

1COMPUTER ASSOCIATES ***** T S S C O M M A N D P R O C E S S O R ***** TSSSSSDB PAGE 2
CA-POT RET/VS 2.0 12/04/2010 11.36.04

my real source file is ~30k lines and has many many ID's with each ID having random # of lines associated with ID, etc. i dunno if the TS report can be created in different ways, but this is the raw source i have to work with.

can you get this into a one line awk? if so hats off to you. my code in post #11 does the job, but i like simpler if you can achieve that. thnx.

output is OFS="|"
and should look like this (the dots just mean continue on, etc)

output file

Code:

ID = TESTTEST|ID = TESTTEST
ID = TESTTEST|TYPE       = MASTER    SIZE       =     4352  BITS
ID = TESTTEST|FACILITY   = *ALL*
ID = TESTTEST|CREATED    = 07/25/01  LAST MOD   = 08/29/09  09:46
ID = TESTTEST|PROFILED   = PRFGTYU
.
.
.
ID = AKIM|ID = AKIM
ID = AKIM|TYPE       = CENTRAL   SIZE       =      512  BITS
ID = AKIM|FACILITY   = *ALL*
ID = AKIM|CREATED    = 08/29/09  LAST MOD   = 10/05/09  15:14
ID = AKIM|PROFILED   = PRFGTYU  PRFBATCH  PROFGEN
.
.
.

input file

Code:

1// JOB TSSLIST   ***  TSS INIT COMMANDS  ***                        DATE 12/04/2010, CLOCK 11/36/05
 // EXEC TSSSSSDB
 1S23D  PHASE TSSSSSDB IS TO BE FETCHED FROM CAISLIF.PRODUCT
1COMPUTER ASSOCIATES             ***** T S S   C O M M A N D   P R O C E S S O R *****             TSSSSSDB     PAGE    1
 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04
  
 *------------------------------------------------------------------------------*
  
 TSS LIST(BASICS) DATA(ALL)
  
 TSS LIST(BASICS) DATA(ALL)
 ID = TESTTEST      NAME       = MASTER SECURITY
 TYPE       = MASTER    SIZE       =     4352  BITS
 FACILITY   = *ALL*
 CREATED    = 07/25/01  LAST MOD   = 08/29/09  09:46
 PROFILED   = PRFGTYU
 ATTRIBUTES = TTY1
 LAST USED  = 08/30/09 16:19 CPU(VPEA) FAC(ICDG    ) COUNT(16381)
 VSESLIB    = FJSWSRS.                    FARTSY.
 VSESLIB    = PRO1.                       PRO2.
 DATASET    = *****
 VOLUMES    = *ALL*(G)
 DCT        = *ALL*
 FCT        = *ALL*
 JCT        = *ALL*
 MODE       = WARN
 OTRAN      = DITT
 PANEL      = REXX
 PPT        = *ALL*
 TERMINAL   = *ALL*       K
 TST        = *ALL*
 XA VSELIB  = VSE.LIBRARY.BLABBER                            OWNER(IRMST   )
    ACCESS  = SOME
 XA VSELIB  = VSE.FARTSY.LIBRARY.BLABBER                     OWNER(IRMST   )
    ACCESS  = SOME
 XA VSELIB  = VSE.PRO1.LIBRARY.PRO1                         OWNER(IRMST   )
    ACCESS  = SOME
 XA VSELIB  = VSE.PRO2.LIBRARY.PRO2                         OWNER(IRMST   )
    ACCESS  = SOME
 XA VSELIB  = VSE.SYSRES.LIBRARY.IJSYSRS                    OWNER(IRMST   )
    ACCESS  = SOME
 XA DATASET = VSE                                           OWNER(IRMST   )
    ACCESS  = SOME
 XA MODE    = FAIL                                          OWNER(IRMST   )
 XA OTRAN   = *ALL*                                         OWNER(IRMST   )
    ACCESS  = SOME
 XA OTRAN   = MD                                            OWNER(IRMST   )
    ACCESS  = SOME
 XA OTRAN   = TSS                                           OWNER(IRMST   )
    ACCESS  = SOME
 -----------  SEGMENT CIPS
 OPIDENT    = AST
 BASICS      = AKIM    -SC SAM    -SC CIPS    (D) CLIMENT(V)
              CLEN    -SC COMM(V) ERTT    -SC GNIT    -SC
              IAUPTRUD-SC JOTC    -SC JTAO    -SC JVOL    -SC
              BLABBER  (Z) TESTY   (D)
  
 ID = AKIM      NAME       = VALUE ADD
1COMPUTER ASSOCIATES             ***** T S S   C O M M A N D   P R O C E S S O R *****             TSSSSSDB     PAGE    2
 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04
  
 TYPE       = CENTRAL   SIZE       =      512  BITS
 FACILITY   = *ALL*
 CREATED    = 08/29/09  LAST MOD   = 10/05/09  15:14
 PROFILED   = PRFGTYU  PRFBATCH  PROFGEN
 ATTRIBUTES = TTY1,VSECATBT,VSERDDIR,VSESYSAD,VSEMCON
 LAST USED  = 10/05/09 15:14 CPU(VPEA) FAC(ICDG    ) COUNT(00177)
 -----------  SEGMENT CIPS
 OPIDENT    = OPD
 -----------  SEGMENT IESIS
 IESFL1     = BAS,COD,VSAT
 IESFL2     = BQS,ESC,CSU,CSD,OSPD,XSM
 IESINIT    = IPSEABH
 IESTYPE    = USERTYPE2,NEW,SELECT
 IESVCAT    = TESTCAP
 -----------  ADMINISTRATION AUTHORITIES
 RESOURCE   = XAUTH,INFO
    ACCESS  = SOME
 ECID       = *ALL*
 FACILITIES = *ALL*
 LIST DATA  = *ALL*,PROFILED,PASSFOO
 MISC1      = SUSPEND
 MISC8      = LISTSTC,LISTRDT,REMASUSP,MCS
  
 ID = SAM      NAME       = SAM WALBERG
 TYPE       = CENTRAL   SIZE       =      512  BITS
 FACILITY   = *ALL*
 CREATED    = 04/17/07  LAST MOD   = 08/01/09  11:50
 PROFILED   = PRFGTYU  PRFBATCH  PROFGEN
 ATTRIBUTES = TTY1,VSECATBT,VSERDDIR,VSESYSAD,VSEMCON
 LAST USED  = 09/28/09 01:59 CPU(VPEA) FAC(BATCH   ) COUNT(04165)
 -----------  SEGMENT CIPS
 OPIDENT    = OP5
 -----------  SEGMENT IESIS
 IESFL1     = BAT,COD,VSAM
 IESFL2     = BQA,ESC,COU,CMD,OLPD,XRM
 IESINIT    = IESEADM
 IESTYPE    = USERTYPE14,NEW,SELECT
 IESVCAT    = TESTCAP
 -----------  ADMINISTRATION AUTHORITIES
 RESOURCE   = *ALL*
    ACCESS  = SOME
 ECID       = *ALL*
 FACILITIES = *ALL*
 LIST DATA  = *ALL*,PROFILED,PASSFOO
 MISC1      = *ALL*
 MISC2      = *ALL*
 MISC3      = *ALL*
 MISC8      = LISTSTC,LISTRDT,REMASUSP,MCS,LISTSDT
 MISC9      = *ALL*

ID = RRQMTTTT  NAME       = RQAMT TEST USER
 TYPE       = USER      SIZE       =      512  BITS
 FACILITY   = TEST
 DEPT ECID  = TESTY     DEPARTMENT = TEST USERS
 CREATED    = 07/13/06  LAST MOD   = 08/24/06  15:29
 PROFILED   = PRFRRQMT  PRFBATCH
 LAST USED  = 07/14/06 13:33 CPU(VSEB) FAC(BATCH   ) COUNT(00012)
 XA VSESLIB = DB3LIPS.TESTBTCH                              OWNER(IRMST   )
    ACCESS  = READ
 XA OTRAN   = CEDF                                          OWNER(IRMST   )
    ACCESS  = EXECUTE
 -----------  SEGMENT CIPS
 OPIDENT    = BBJDS
  
 TSP0320I  LIST     FUNCTION SUCCESSFUL
  
  
 *------------------------------------------------------------------------------*
  
  
1COMPUTER ASSOCIATES             ***** T S S   C O M M A N D   P R O C E S S O R *****             TSSSSSDB     PAGE  444
 CA-POT RET/VS 2.0                                                                            12/04/2010    11.36.04
  
 TSS INPUT STATEMENTS READ              1
 TSS COMMANDS PROCESSED                 1
 TSS BATCH ENVIRONMENT ERRORS           0
 TSS COMMAND ERRORS                     0
1EOJ TSSLIST                                                         DATE 12/04/2010, CLOCK 11/36/20, DURATION   00/00/15

Nominal Animal · 11-05-2011, 12:22 PM

This should do the job, and also split each fact on a separate line. It does a bit more than you asked for, but I guess this is what I'd start with.

Code:

tr -s '\r\n' '\n\n' < infile | sed -e 's|[\t\v\f ]*=[\t\v\f ]*|=|g; s|[\t\v\f ][\t\v\f ]\+\([^\t\v\f =]\+\([\t\v\f ][^\t\v\f =]\+\)*=\)|\n\1|g' | sed -e 's|^[\t\v\f ]\+||; s|[\t\v\f ]\+$||; s|[\t\v\f ]\+)|)|g; s|[\t\v\f ][\t\v\f ]\+| |g' | awk '/^ID=/ { id=$0 ; print id ; next } /=/ && length(id) { print id "|" $0 }' >outfile

The tr converts all newline conventions to standard Unix newlines.

The first sed removes whitespace around equals signs. Also, if there are multiple consecutive whitespaces, followed by some term (which may contain nonconsecutive whitespaces) and an equals sign, it splits the line at the consecutive whitespace. This makes sure each fact is on its own line.

The second sed removes leading and trailing whitespace, all whitespace before a close parenthesis, and combines multiple consecutive whitespaces into one. (Because the first sed introduces new newlines, I find it is easiest to flatten the data stream by using a separate sed command. It makes it easier to develop such long pipe stanzas.)

The awk part picks the ID values (also printing them alone), and for any line containing an equals sign, prints the id and the line. If you do not need the ID alone, just omit the first print .

If the input may contain pipes, I recommend prepending s/|/!/g; to the first sed pattern.

If you prefer the whitespace around = and |, add | sed -e 's/$[|=]$/ \1 /g' just before the >outfile .

To see which input lines are ignored/omitted by the above command, replace the end, starting at awk, with grep -v -e '=' -e '^[\t\v\f ]*$'

Linux_Kidd · 11-05-2011, 02:33 PM

Nominal,
i will try that. thnx.