Rearrange columns in a csv and do calculations

curiouslinus · 11-25-2021, 01:14 AM

Hello All,

A newbie here.

I have a need to rearrange a few entries from a csv file and am trying to achieve it using awk or sed.

Input file:

abc,cpu,10
abc,mem,20
abc,disk,10
def,cpu,5
def,mem,10
def,disk,50
xyz,cpu,20
xyz,mem,40
xyz,disk,30

Expected output:

App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30
Bal,65,30,10

The Bal is calculated by adding the column values and subtracting by 100.

Any ideas on whats the best way to achieve this?

shruggy · 11-25-2021, 01:33 AM

See this thread. It's not what you're looking for, but may give you some ideas. This transformation is usually called pivoting or reshaping data long-to-wide. Searching for these terms will give you plenty of results.

E.g. using miller that would be

Code:

mlr --csv -N reshape -s 2,3

curiouslinus · 11-25-2021, 02:07 AM

Unfortunately, its a restricted environment and I cannot install any additional command line tools like miller. But thanks a lot for the thread link. I did find a few useful pointers there and will be trying out those.

syg00 · 11-25-2021, 02:20 AM

Quote:

Originally Posted by curiouslinus

Unfortunately, its a restricted environment and I cannot install any additional command line tools like miller.

In which case let's hope you have gawk - I have found the extensions in-built by GNU make it such a pain trying to use a lesser awk I now no longer even try to be posix compliant.

shruggy · 11-25-2021, 03:04 AM

Well, it largely depends on the structure of your input data. As long as it is as simple as in your example above, you can easily process them with awk. But CSV is a tricky format. If your data may contain embedded commas and/or double quotes, using Perl or Python with a CSV parsing module would be preferable. gawk may give you a remedy in form of FPAT/patsplit, but it won't account for all possible CSV quirks.

Another option is using a spreadsheet software. Most support pivot tables and will import from / export to CSV.

curiouslinus · 11-25-2021, 03:28 AM

The input file is actually generated by another script. So I do have some control over the structure of the data.
Anyway, I have made some progress by combining a few awk examples I came across.

awk -F\, '
NR>1 {
if(!($1 in app)) { app[++num] = $1 }; app[$1]++
if(!($2 in infra)) { infra[++types] = $2 }; infra[$2]++
map[$2,$1] = $3
}
END {
printf "%s," ,"App";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, infra[ind];
sep = ","
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", app[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[infra[val], app[coun]];
}
print ""
}
}' input_file

Output
App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30

shruggy · 11-25-2021, 04:20 AM

Well, if order of cpu,mem,disk lines is always the same and you don't have to preserve the order of the results, this will do as well:

Code:

awk -F\, '
{ stats[$1] = stats[$1] FS $3 }
END {
  print"App,cpu,mem,disk"
  for (app in stats) print app stats[app]
}' input_file

That said, gawk allows you to control array scanning order.

I guess you want to keep running totals for each stat in the main awk loop as well, and print them at the end

Code:

totals[$2]+=$3

curiouslinus · 11-25-2021, 05:27 AM

Thanks a lot for that!!

Really helpful.

syg00 · 11-25-2021, 05:35 AM

As shruggy demonstrated, it pays not to constrain your [g]awk coding with old-style language paradigms. It's quite fun to play with.

If you appreciate his posts, click on the "yes" button to enhance his rep - well deserved.

allend · 12-03-2021, 07:57 AM

Just for fun. (Called with awk -F\, -f in.awk -- in.txt).

Code:

BEGIN {
# For array subscripts
  x = "A"
  y = "B"
  z = "C"
}
{
# Read the file
# Collect unique entries
  a[$1]++
  b[$2]++

# Collect data in an array
  d[NR, x] = $1
  d[NR, x, y] = $2
  d[NR, x, y, z] = $3
}
END {
# Print column headers
  ORS=""
  print "App"
  for (j in b) {
    print " " j
  }
  print "\n"
# Print data in transformed format
  for (i in a) {
    print i
    for (j in b) {
      for (n=1; n<=NR; n++) {
        if (d[n, x] == i && d[n, x, y] == j) {
          print ","  d[n, x, y, z]
# Save running totals
          t[j] += d[n, x, y, z]
        }
      }
    }
  print "\n"
  }
# Print totals
  print "Sum"
  for (j in t) {
    print "," t[j]
  }
  print "\n"
# Print 100 -Totals
  print "Bal"
  for (j in t) {
    print "," 100-t[j]
  }
  print "\n"
}

shruggy · 12-04-2021, 09:00 AM

@allend.

Code:

print " " j

probably should be

Code:

print "," j

But more importantly,

Quote:

Originally Posted by The GNU Awk User’s Guide

The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within awk and in standard awk cannot be controlled or changed.

Here are the results of running your script (with the change above applied):

gawk 5.1.1:

Code:

App,mem,disk,cpu
def,10,50,5
abc,20,10,10
xyz,40,30,20
Sum,70,90,35
Bal,30,10,65

mawk 1.3.4^20200120:

Code:

App,cpu,disk,mem
abc,10,10,20
def,5,50,10
xyz,20,30,40
Sum,35,90,70
Bal,65,10,30

busybox 1.33.0:

Code:

App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30
Sum,35,70,90
Bal,65,30,10

BWK awk (aka OneTrueAwk aka original-awk) 20210215:

Code:

App,disk,cpu,mem
abc,10,10,20
def,50,5,10
xyz,30,20,40
Sum,90,35,70
Bal,10,65,30

Quote:

Originally Posted by curiouslinus

So I do have some control over the structure of the data.

It would make sense to generate input data sorted in lexicographical order then. I.e. cpu,disk,mem rather than cpu,mem,disk. This way the gawk script could be greatly simplified:

Code:

#!/usr/bin/gawk -f
# Assuming the input data are already sorted (sort -t,)
BEGIN { FS=","; PROCINFO["sorted_in"]="@ind_str_asc" }
{
  stats[$1] = stats[$1] FS $3
  totals[$2] += $3
}
END {
  print "App,cpu,disk,mem"
  for (app in stats) print app stats[app]
  printf "Bal"
  for (t in totals) printf ",%s", 100-totals[t]
  print ""
}

Speaking of fun. Python, of course, has the standard module csv, but reshaping table data with pandas or agate definitely makes more fun. Here it is with agate:

Code:

#!/usr/bin/python3
import agate

c_names=('App', 'stat', 'val')
table = agate.Table.from_csv('in.csv',
                             column_names=c_names,
                             header=None)
result = table.denormalize(*c_names)
totals = result.aggregate([
            (1, agate.Sum('cpu')),
            (2, agate.Sum('mem')),
            (3, agate.Sum('disk'))
        ]).values()

result.print_csv()
print('Bal', *[100-x for x in totals], sep=',')

allend · 12-05-2021, 09:34 AM

@shruggy - Thanks for taking the time on my experimentation with multidimensional arrays.
Nice catch on the

Code:

print "," j

The sorting with gawk can be addressed using 'asorti', but that appears to be an unsupported function in busybox awk.
I do not have mawk or BWK.

Code:

# Print column headers
  k=asorti(a, f)
  m=asorti(b, g)
  ORS=""
  print "App"
  for (j=1; j<=m; j++) {
    print "," g[j]
  }
  print "\n"
# Print data in transformed format
  for (i=1; i<=k; i++) {
    print f[i]
    for (j=1; j<=k; j++) {
     for (n=1; n<=NR; n++) {
        if (d[n, x] == f[i] && d[n, x, y] == g[j]) {
          print ","  d[n, x, y, z]

gawk 5.1.1

Code:

App,cpu,disk,mem
abc,10,10,20
def,5,50,10
xyz,20,30,40
Sum,35,90,70
Bal,65,10,30