LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   split multi line record into multiple files with awk (https://www.linuxquestions.org/questions/programming-9/split-multi-line-record-into-multiple-files-with-awk-768570/)

pcock 11-11-2009 11:42 PM

split multi line record into multiple files with awk
 
Hi I have a large file 'NS0923.csv' with data like the following. There are two records in this multi-record sample.


Code:

E60898,4578910,03/06/09,BEN BOYD RD,61,82,,,127,3,,52000.3046.001,3155,4.00,,PLT,1356,1.00,05/06/09,Y,Y,0551
,,,,,,,,,,,,4057,1.00,CLEAN CAR SHARE SIGN,LAB,0551,1.00,,,,
,,,,,,,,,,,,,,,LAB,3065,1.00,,,,
,,,,,,,,,,,,,,,MAT,PSTD,4.00,,,,
E60897,4575328,03/06/09,BEN BOYD LANE,62,78,,,127,3,,52000.3046.001,3155,1.00,,PLT,1356,0.50,05/06/09,Y,Y,0551
,,,,,,,,,,,,,,,LAB,0551,0.50,,,,
,,,,,,,,,,,,,,,LAB,3065,0.50,,,,
,,,,,,,,,,,,,,,MAT,PSTD,1.00,,,,

I need to create three outputs from the above file.

1. I have come up with the following awk script.

Code:

gawk  'BEGIN {FS=OFS=","}
{
 if ( $1 ~ /E[0-9]+(?:\.[0-9]*)?/ )
  {
      print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$19,$20,$21,$22
  }

}' NS0923.csv > processed.csv

that produces:

Code:

E60898,4578910,03/06/09,BEN BOYD RD,61,82,,,127,3,,52000.3046.001,05/06/09,Y,Y,0551
E60897,4575328,03/06/09,BEN BOYD LANE,62,78,,,127,3,,52000.3046.001,05/06/09,Y,Y,0551

comments/suggestions welcome :)

2. I still have to create a file 'transaction.csv' that should retrieve data from $13 - $15 with the identifying column $1.

Required output:

Code:

E60898,4057,1.00,CLEAN CAR SHARE SIGN

3. And finally another file 'quantity.csv'. Retrieving data from $16 - $18 with identifier $1.

Required output:

Code:

E60898,PLT,1356,1.00
E60898,LAB,0551,1.00
E60898,LAB,3065,1.00
E60898,MAT,PSTD,4.00
E60897,PLT,1356,0.50
E60897,LAB,0551,0.50
E60897,LAB,3065,0.50
E60897,MAT,PSTD,1.00

Thanks in advance :)

chrism01 11-12-2009 12:00 AM

The basic principle is simple in Perl
Code:

# split, retrieving only specified fields
$var1 = "asd,,,,z,x1,,c2";
@arr = (split(/,/, $var1))[0, 5];
print "@arr\n";


asd x1

Just specify the field nums you need for each output.

ghostdog74 11-12-2009 12:10 AM

ok, so you are not asking a question right? that's how you print your columns in awk. if you want, you can also use a for loop,
Code:

...
  for(i=1;i<=12;i++){ print $i }
  for(i=19;i<=22;i++){ print $i }
...


pcock 11-12-2009 12:17 AM

I'm afraid I do. I want to find out how can I produce

Code:

E60898,4057,1.00,CLEAN CAR SHARE SIGN
and

Code:

E60898,PLT,1356,1.00
E60898,LAB,0551,1.00
E60898,LAB,3065,1.00
E60898,MAT,PSTD,4.00
E60897,PLT,1356,0.50
E60897,LAB,0551,0.50
E60897,LAB,3065,0.50
E60897,MAT,PSTD,1.00


ghostdog74 11-12-2009 12:25 AM

just use the same method of printing as you had done at 1) ? or does the data actually have a newline?

pcock 11-12-2009 02:28 AM

I can only use $1 for the first line, how do I use the $1 from the first line for lines 2,3 & 4 say. Then for the next set of records I'll have to do the same. I am hoping that there's a way to do this in awk.
Thanks.

ghostdog74 11-12-2009 02:47 AM

you save the first $1 into a variable

pcock 11-12-2009 02:52 AM

sorry for the confussion.

not all lines start with EXXXX.
yes there are multiple lines of the same record.

bigearsbilly 11-12-2009 03:17 AM

well, if you RTFM you can see that you can use
the print directly to a file.

Code:

gawk  'BEGIN {FS=OFS=","}
{
 if ( $1 ~ /E[0-9]+(?:\.[0-9]*)?/ )
    {
      print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$19,$20,$21,$22
      print $1,$13,$15 > "transvestite.csv"
      print $1,$16,$18 > "quantitty.csv"
  }

  }' NS0923.csv  > processed.csv

simples

pcock 11-12-2009 03:54 AM

Okay, let me try this again - It's a multiline file.

record1,foo,hello,world
aa,bb,cc
xx,yy,zz
record2,bar,hello,world
dd,ee,ff
uu,vv,ww


I want a way to output this.

record1,aa,bb,cc
record1,xx,yy,zz
record2,dd,ee,ff
record2,uu,vv,ww

ghostdog74 11-12-2009 04:12 AM

Code:

$ awk -F"," '/record/{s=$1;next}{print s,$0}' OFS="," file
record1,aa,bb,cc
record1,xx,yy,zz
record2,dd,ee,ff
record2,uu,vv,ww



All times are GMT -5. The time now is 10:55 PM.