LinuxQuestions.org - So, who want's to help me format a .csv? It'll be fun, promise!

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - So, who want's to help me format a .csv? It'll be fun, promise! (https://www.linuxquestions.org/questions/programming-9/so-who-wants-to-help-me-format-a-csv-itll-be-fun-promise-823241/)

So, who want's to help me format a .csv? It'll be fun, promise!

well, maybe not. I have a 7mb .csv that is in a completely unusable format. Right now it looks like this:

Code:

AAXJ,07/21/2010,53.9900

AAXJ,07/22/2010,54.8200

AAXJ,07/23/2010,55.2900

AAXJ,07/26/2010,55.7200

AAXJ,07/27/2010,56.0000

AAXJ,07/28/2010,55.8300

ACWI,01/02/2009,32.2900

ACWI,01/05/2009,32.7100

ACWI,01/06/2009,33.0300

ACWI,01/07/2009,32.3200

ACWX,01/02/2009,30.5500

ACWX,01/05/2009,30.7800

ACWX,01/06/2009,31.1500

ACWX,01/07/2009,30.5700

ACWX,01/08/2009,30.4200

Not exactly useful. I need to know how to sort it by the leading value. So:

Code:

01/01/09, AAAA, AAAB, AAAC

01/03/09, 33.23,22.0, 18.75

I know it's probably not feasible to match up the dates, that's fine. Having both a date and a value column for each symbol would be almost as perfect.

and so forth

I think you are going to have to explain further as:

Quote:

Not exactly useful. I need to know how to sort it by the leading value.

Would appear to have nothing at all to do with the output:

Quote:

01/01/09, AAAA, AAAB, AAAC
01/03/09, 33.23,22.0, 18.75

Quote:

Originally Posted by grail (Post 4051001)

I think you are going to have to explain further as:

Would appear to have nothing at all to do with the output:

you're right, that wasn't at all clear. I'm fairly tired right now.

Date, AAXJ, ACWI, ACWX, should be what the top line says.

Hi,

Your desired end-result is still a bit vague:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/03/09, 33.23,22.0, 18.75

I assume you mean:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/01/09, 33.23,22.0, 18.75

As in 2 lines with the same date. First line <date> data fields and second line corresponding date with number fields.

If that is the case and your example input from post #1 is relevant:

Code:

#!/bin/bash



inFile="infile"



sort -t, -k2,2 $inFile | \

awk -F, 'BEGIN {

  prevDate = "X"

  dataPart = ""

  numbPart = ""

}

{

  if ( prevDate != $2 ) {

    if ( prevDate != "X" ) {

      printf("%s,%s\n",prevDate,dataPart)

      printf("%s,%s\n",prevDate,numbPart)

    }

    prevDate = $2

    dataPart = ""

    numbPart = ""

  }

  dataPart = dataPart $1","

  numbPart = numbPart $3","

}'

Sample run:

Code:

$ ./csv.reformat.sh 

01/02/2009,ACWI,ACWX,ACWY,

01/02/2009,32.2900,30.5500,35.2900,

01/05/2009,ACWI,ACWX,

01/05/2009,32.7100,30.7800,

01/06/2009,ACWI,ACWX,ACWY,

01/06/2009,33.0300,31.1500,21.1500,

01/07/2009,ACWI,ACWX,

01/07/2009,32.3200,30.5700,

01/08/2009,ACWX,

01/08/2009,30.4200,

07/21/2010,AAXJ,AAXY,

07/21/2010,53.9900,43.9900,

07/22/2010,AAXJ,

07/22/2010,54.8200,

07/23/2010,AAXJ,

07/23/2010,55.2900,

07/26/2010,AAXJ,

07/26/2010,55.7200,

07/27/2010,AAXJ,

07/27/2010,56.0000,

2 things about the output generated:

1) I removed the spaces after the comma. They are probably not needed (the output being a csv file too)
2) There is a trailing , after each line. Probably not a problem, but you easily remove them by changing the last line ( }' ) with this: }' | sed 's/,$//'

Hope this helps.

BTW: I enlarged the input file for better testing, that's why the results won't match the data given in post #1.

Thanks a bunch man, but this is not QUITE what I need. This is so close, and I think I should be able to figure this out on my own, but I'm just not able to. Here is what the final output has to look like.

Code:

Date,AAXJ,ACWI,ACWX

01/02/2009,,32.2900,30.5500

01/05/09,,32.7100,30.7800

etc,etc,etc,etc

Sorry for not all the confusion, and I hope this is clear. And thanks again for all the help.

You could try something like:

Code:

#!/usr/bin/awk -f



BEGIN{FS=","}



/AAXJ/{aaxj[$2]=$3}

/ACWI/{acwi[$2]=$3}

/ACWX/{acwx[$2]=$3}



{d[$2]++}



END{

        printf "%7s %9s %7s %7s\n", "Date","AAXJ","ACWI","ACWX"



        n = asorti(d,f)

        for(;++x <= n;)

                printf "%10s %07.4f %07.4f %07.4f\n",f[x],aaxj[f[x]],acwi[f[x]],acwx[f[x]]

}

Call the script with your file as argument.

Hi,

I'm still willing to help you out, but you do need to be precise.....

You say this is the desired output, which looks weird to be honest:

Quote:

Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/09,,32.7100,30.7800

Shouldn't that be something like this:

Quote:

Date,AAXJ,ACWI,ACWX
01/02/2009,,32.2900,30.5500
01/05/2009,,32.7100,30.7800

Also; was my previous assumption correct? This:

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/03/09, 33.23,22.0, 18.75

Quote:

01/01/09, AAXJ, ACWI, ACWX
01/01/09, 33.23,22.0, 18.75

Assuming I'm right about the above mentioned points:

I do have a workable solution, if you are able to add an extra line to the 7 Mb csv file you have. This is needed otherwise the last line of input is not taken into account. I need a unique End-Of-File line.

I don't think it is too hard to add to your csv file: echo "ZZZZ,99/99/9999,99.9999" >> file.csv (Mind the >>, 2 of them, not 1!!!).

Code:

#!/bin/bash



inFile="csv.reformat.infile"



sort -t, -k2,2 $inFile | \

awk -F, 'BEGIN {

  prevDate = "X"

  dataArray[0] = ""

  dataArray[1] = ""

  dataArray[2] = ""

  dataArray[3] = ""



  print "Date,AAXJ,ACWI,ACWX"

}

{

  if ( prevDate != $2 ) {

    if ( prevDate != "X" ) {

      printf("%s,%s,%s,%s\n",dataArray[0],dataArray[1],dataArray[2],dataArray[3])

    }

    prevDate = $2

    dataArray[0] = ""

    dataArray[1] = ""

    dataArray[2] = ""

    dataArray[3] = ""

  }



  dataArray[0] = $2

  if ( $1 == "AAXJ") dataArray[1] = $3

  if ( $1 == "ACWI") dataArray[2] = $3

  if ( $1 == "ACWX") dataArray[3] = $3



}' 



exit 0

Example run:

Code:

 $ cat csv.reformat.infile

AAXJ,07/21/2010,53.9900

AAXJ,07/22/2010,54.8200

AAXJ,07/23/2010,55.2900

AAXJ,07/26/2010,55.7200

AAXJ,07/27/2010,56.0000

AAXJ,07/28/2010,55.8300

ACWI,01/02/2009,32.2900

ACWI,01/05/2009,32.7100

ACWI,01/06/2009,33.0300

ACWI,01/07/2009,32.3200

ACWX,01/02/2009,30.5500

ACWX,01/05/2009,30.7800

ACWX,01/06/2009,31.1500

ACWX,01/07/2009,30.5700

ACWX,01/08/2009,30.4200

AAXJ,12/02/2010,10.1000

ACWI,12/03/2010,20.2000

ACWX,12/04/2010,30.3000

ZZZZ,99/99/9999,99.9999



$ ./csv.reformat.sh

Date,AAXJ,ACWI,ACWX

01/02/2009,,32.2900,30.5500

01/05/2009,,32.7100,30.7800

01/06/2009,,33.0300,31.1500

01/07/2009,,32.3200,30.5700

01/08/2009,,,30.4200

07/21/2010,53.9900,,

07/22/2010,54.8200,,

07/23/2010,55.2900,,

07/26/2010,55.7200,,

07/27/2010,56.0000,,

07/28/2010,55.8300,,

12/02/2010,10.1000,,

12/03/2010,,20.2000,

12/04/2010,,,30.3000

The blue part I added to your original dataset for testing. The bold part is the extra line I was talking about earlier.

Hope this helps.

@grail: Just noticed your post. Its a beauty!

Thanks ya'll these are absolutely brilliant. I think I have this under control now.