LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   CSV file - Duplicate records need merging | BASH? (https://www.linuxquestions.org/questions/programming-9/csv-file-duplicate-records-need-merging-%7C-bash-649452/)

colucix 12-09-2010 05:46 AM

Assuming you can use the code with the asorti function, as per post #2:
Code:

BEGIN{ FS = ","; getline }
{
  balance[$1] = balance[$1] + $2
}
END{
  n = asorti(balance,indices)

  for (i = 1; i <= n; i++)
      printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}

you can change it to the following:
Code:

BEGIN{ FS = ","; getline }
{
  balance[$1] = ( balance[$1] "," $2 )
}
END{
  n = asorti(balance,indices)

  for (i = 1; i <= n; i++)
      printf "%s%s\n", indices[i], balance[indices[i]]
}

Here you don't sum $2, but concatenate values in a string, using comma as separator. Then you have to change the format in the printf statement, since you have to print out a string and not a floating point number. The same modifications can be applied to the other versions of the code. Hope this helps.

czezz 12-09-2010 06:36 AM

For the original input file it works perfectly.
However my real file is slight different... I thought it will be easy for me to modify it but it occur much more difficult.

here is a sample
Code:

# text text text
12/7/10 00:00,gg2a,15791,3372,4018,5,
12/7/10 00:00,gg2b,4961,92,31190,4,
# text2 text2 text2
12/7/10 00:00,gg2a,1.8840170106E10,3.043735864E9,1.5796434242E10,1.7081492E7,
12/7/10 00:00,gg2b,8.6964647131E10,1.1799862993E10,7.5164784138E10,7.1079514E7,

What I was going to achive:
Code:

12/7/10 00:00,gg2a,15791,3372,4018,5,1.8840170106E10,3.043735864E9,1.5796434242E10,1.7081492E7,
12/7/10 00:00,gg2b,4961,92,31190,4,8.6964647131E10,1.1799862993E10,7.5164784138E10,7.1079514E7,

My script modification is that (but doesnt work too good :/ )
Code:

BEGIN{ FS = ","; getline }
{
  balance[$1 "," ,$2, "," $3 "," ,$4 "," ,$5 "," ,$6] = ( balance[$1 "," ,$2, "," $3 "," ,$4 "," ,$5 "," ,$6] "," $3 "," $4 )
}
END{
  n = asorti(balance,indices)

  for (i = 1; i <= n; i++)
      printf "%s%s\n", indices[i], balance[indices[i]]
}


colucix 12-09-2010 07:00 AM

Yes, things are a bit more complicate here. First a question: based on what criteria do you merge lines? How many variants of the "gg" field may occur?

czezz 12-09-2010 07:36 AM

merge line criteria is:
Code:

12/7/10 00:00,gg2a
where "gg" may be: gg2a, gg2b, gg4a, gg4b, etc.

There 2 types of lines but each contain 6 columns separated by comma:
eg. type 1:
Code:

12/7/10 00:00,gg2b,4961,92,31190,4,
eg. type 2:
Code:

12/7/10 00:00,gg2a,1.8840170106E10,3.043735864E9,1.5796434242E10,1.7081492E7,
Sometimes between them, hashed lines may occur which should be ignored.

colucix 12-09-2010 08:43 AM

Here is a working example based on the input data in post #17:
Code:

BEGIN { FS = "," }

! /^#/ {
 
  balance[$1 "," $2] = ( balance[$1 "," $2] "," $3  "," $4  "," $5 "," $6 )

}

END {
    n = asorti(balance,indices)

    for (i = 1; i <= n; i++)
        printf "%s%s,\n", indices[i], balance[indices[i]]
}

First note the (negated) regular expression before the main rule: ! /^#/. This excludes every line that begins with a hash (or in other world the rule is applied to every line that does not begin with hash). The rest should be clear, as you already tried something similar.

I have some doubt about the sorting process, anyway. First, if you want to sort by date (for example from the oldest to the most recent) you should have a date format that can be naturally sorted in an alphanumeric sense. For example:
Code:

10/07/12 00:00
10/08/13 04:00

in this way the asorti function sorts strings by means of the alphanumeric order and the result is automaticaaly sorted by date. In alternative you might transform the date string in a date number (julian date), sort them numerically and finally transform them back to the original format. You can do this using awk's time functions.

czezz 12-10-2010 04:32 AM

Works perfect! Thank you.

There is one my big mistake which makes to you not clear situation.
The flag gg should always be "gg[one digit 0-9][one a-z].
I have written that:
Quote:

where "gg" may be: gg2a, gg2b, gg4a, gg4b, etc.
then by mistake I have written in one example other name, longer name which alrdy corrected.
Sorry for that.

colucix 12-10-2010 05:01 AM

Corrected my posts accordingly! :)


All times are GMT -5. The time now is 09:53 AM.