So, who want's to help me format a .csv? It'll be fun, promise!

pinecone · 07-31-2010, 10:00 AM

well, maybe not. I have a 7mb .csv that is in a completely unusable format. Right now it looks like this:

Code:

AAXJ,07/21/2010,53.9900
AAXJ,07/22/2010,54.8200
AAXJ,07/23/2010,55.2900
AAXJ,07/26/2010,55.7200
AAXJ,07/27/2010,56.0000
AAXJ,07/28/2010,55.8300
ACWI,01/02/2009,32.2900
ACWI,01/05/2009,32.7100
ACWI,01/06/2009,33.0300
ACWI,01/07/2009,32.3200
ACWX,01/02/2009,30.5500
ACWX,01/05/2009,30.7800
ACWX,01/06/2009,31.1500
ACWX,01/07/2009,30.5700
ACWX,01/08/2009,30.4200

Not exactly useful. I need to know how to sort it by the leading value. So:

Code:

01/01/09, AAAA, AAAB, AAAC
01/03/09, 33.23,22.0, 18.75

I know it's probably not feasible to match up the dates, that's fine. Having both a date and a value column for each symbol would be almost as perfect.

and so forth

grail · 07-31-2010, 10:33 AM

I think you are going to have to explain further as:

Quote:

Not exactly useful. I need to know how to sort it by the leading value.

Would appear to have nothing at all to do with the output:

Quote:

01/01/09, AAAA, AAAB, AAAC
01/03/09, 33.23,22.0, 18.75

pinecone · 07-31-2010, 10:40 AM

Quote:

Originally Posted by grail

I think you are going to have to explain further as:

Would appear to have nothing at all to do with the output:

you're right, that wasn't at all clear. I'm fairly tired right now.

Date, AAXJ, ACWI, ACWX, should be what the top line says.

druuna · 07-31-2010, 12:44 PM

Hi,

Your desired end-result is still a bit vague:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/03/09, 33.23,22.0, 18.75

I assume you mean:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/01/09, 33.23,22.0, 18.75

As in 2 lines with the same date. First line <date> data fields and second line corresponding date with number fields.

If that is the case and your example input from post #1 is relevant:

Code:

#!/bin/bash

inFile="infile"

sort -t, -k2,2 $inFile | \
awk -F, 'BEGIN {
  prevDate = "X"
  dataPart = ""
  numbPart = ""
}
{
  if ( prevDate != $2 ) {
    if ( prevDate != "X" ) {
      printf("%s,%s\n",prevDate,dataPart)
      printf("%s,%s\n",prevDate,numbPart)
    }
    prevDate = $2
    dataPart = ""
    numbPart = ""
  }
  dataPart = dataPart $1","
  numbPart = numbPart $3","
}'

Sample run:

Code:

$ ./csv.reformat.sh 
01/02/2009,ACWI,ACWX,ACWY,
01/02/2009,32.2900,30.5500,35.2900,
01/05/2009,ACWI,ACWX,
01/05/2009,32.7100,30.7800,
01/06/2009,ACWI,ACWX,ACWY,
01/06/2009,33.0300,31.1500,21.1500,
01/07/2009,ACWI,ACWX,
01/07/2009,32.3200,30.5700,
01/08/2009,ACWX,
01/08/2009,30.4200,
07/21/2010,AAXJ,AAXY,
07/21/2010,53.9900,43.9900,
07/22/2010,AAXJ,
07/22/2010,54.8200,
07/23/2010,AAXJ,
07/23/2010,55.2900,
07/26/2010,AAXJ,
07/26/2010,55.7200,
07/27/2010,AAXJ,
07/27/2010,56.0000,

2 things about the output generated:

1) I removed the spaces after the comma. They are probably not needed (the output being a csv file too)
2) There is a trailing , after each line. Probably not a problem, but you easily remove them by changing the last line ( }' ) with this: }' | sed 's/,$//'

Hope this helps.

BTW: I enlarged the input file for better testing, that's why the results won't match the data given in post #1.

pinecone · 08-09-2010, 09:14 AM

Thanks a bunch man, but this is not QUITE what I need. This is so close, and I think I should be able to figure this out on my own, but I'm just not able to. Here is what the final output has to look like.

Code:

Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/09,,32.7100,30.7800
etc,etc,etc,etc

Sorry for not all the confusion, and I hope this is clear. And thanks again for all the help.

grail · 08-09-2010, 10:32 AM

You could try something like:

Code:

#!/usr/bin/awk -f

BEGIN{FS=","}

/AAXJ/{aaxj[$2]=$3}
/ACWI/{acwi[$2]=$3}
/ACWX/{acwx[$2]=$3}

{d[$2]++}

END{
	printf "%7s %9s %7s %7s\n", "Date","AAXJ","ACWI","ACWX"

	n = asorti(d,f)
	for(;++x <= n;)
		printf "%10s %07.4f %07.4f %07.4f\n",f[x],aaxj[f[x]],acwi[f[x]],acwx[f[x]]
}

Call the script with your file as argument.

druuna · 08-09-2010, 10:57 AM

Hi,

I'm still willing to help you out, but you do need to be precise.....

You say this is the desired output, which looks weird to be honest:

Quote:

Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/09,,32.7100,30.7800

Shouldn't that be something like this:

Quote:

Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/2009,,32.7100,30.7800

Also; was my previous assumption correct? This:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/03/09, 33.23,22.0, 18.75

vs

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/01/09, 33.23,22.0, 18.75

Assuming I'm right about the above mentioned points:

I do have a workable solution, if you are able to add an extra line to the 7 Mb csv file you have. This is needed otherwise the last line of input is not taken into account. I need a unique End-Of-File line.

I don't think it is too hard to add to your csv file: echo "ZZZZ,99/99/9999,99.9999" >> file.csv (Mind the >>, 2 of them, not 1!!!).

Code:

#!/bin/bash

inFile="csv.reformat.infile"

sort -t, -k2,2 $inFile | \
awk -F, 'BEGIN {
  prevDate = "X"
  dataArray[0] = ""
  dataArray[1] = ""
  dataArray[2] = ""
  dataArray[3] = ""

  print "Date,AAXJ,ACWI,ACWX"
}
{
  if ( prevDate != $2 ) {
    if ( prevDate != "X" ) {
       printf("%s,%s,%s,%s\n",dataArray[0],dataArray[1],dataArray[2],dataArray[3])
    }
    prevDate = $2
    dataArray[0] = ""
    dataArray[1] = ""
    dataArray[2] = ""
    dataArray[3] = ""
  }

  dataArray[0] = $2
  if ( $1 == "AAXJ") dataArray[1] = $3
  if ( $1 == "ACWI") dataArray[2] = $3
  if ( $1 == "ACWX") dataArray[3] = $3

}' 

exit 0

Example run:

Code:

 $ cat csv.reformat.infile
AAXJ,07/21/2010,53.9900
AAXJ,07/22/2010,54.8200
AAXJ,07/23/2010,55.2900
AAXJ,07/26/2010,55.7200
AAXJ,07/27/2010,56.0000
AAXJ,07/28/2010,55.8300
ACWI,01/02/2009,32.2900
ACWI,01/05/2009,32.7100
ACWI,01/06/2009,33.0300
ACWI,01/07/2009,32.3200
ACWX,01/02/2009,30.5500
ACWX,01/05/2009,30.7800
ACWX,01/06/2009,31.1500
ACWX,01/07/2009,30.5700
ACWX,01/08/2009,30.4200
AAXJ,12/02/2010,10.1000
ACWI,12/03/2010,20.2000
ACWX,12/04/2010,30.3000
ZZZZ,99/99/9999,99.9999

$ ./csv.reformat.sh
Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/2009,,32.7100,30.7800
01/06/2009,,33.0300,31.1500
01/07/2009,,32.3200,30.5700
01/08/2009,,,30.4200
07/21/2010,53.9900,,
07/22/2010,54.8200,,
07/23/2010,55.2900,,
07/26/2010,55.7200,,
07/27/2010,56.0000,,
07/28/2010,55.8300,,
12/02/2010,10.1000,,
12/03/2010,,20.2000,
12/04/2010,,,30.3000

The blue part I added to your original dataset for testing. The bold part is the extra line I was talking about earlier.

Hope this helps.

druuna · 08-09-2010, 11:01 AM

@grail: Just noticed your post. Its a beauty!

pinecone · 08-09-2010, 05:08 PM

Thanks ya'll these are absolutely brilliant. I think I have this under control now.