LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 11-25-2021, 01:14 AM   #1
curiouslinus
LQ Newbie
 
Registered: Nov 2021
Posts: 4

Rep: Reputation: Disabled
Rearrange columns in a csv and do calculations


Hello All,

A newbie here.

I have a need to rearrange a few entries from a csv file and am trying to achieve it using awk or sed.

Input file:

abc,cpu,10
abc,mem,20
abc,disk,10
def,cpu,5
def,mem,10
def,disk,50
xyz,cpu,20
xyz,mem,40
xyz,disk,30


Expected output:

App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30
Bal,65,30,10

The Bal is calculated by adding the column values and subtracting by 100.

Any ideas on whats the best way to achieve this?
 
Old 11-25-2021, 01:33 AM   #2
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,688

Rep: Reputation: Disabled
See this thread. It's not what you're looking for, but may give you some ideas. This transformation is usually called pivoting or reshaping data long-to-wide. Searching for these terms will give you plenty of results.

E.g. using miller that would be
Code:
mlr --csv -N reshape -s 2,3

Last edited by shruggy; 11-25-2021 at 01:46 AM.
 
Old 11-25-2021, 02:07 AM   #3
curiouslinus
LQ Newbie
 
Registered: Nov 2021
Posts: 4

Original Poster
Rep: Reputation: Disabled
Unfortunately, its a restricted environment and I cannot install any additional command line tools like miller. But thanks a lot for the thread link. I did find a few useful pointers there and will be trying out those.
 
Old 11-25-2021, 02:20 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,163

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
Quote:
Originally Posted by curiouslinus View Post
Unfortunately, its a restricted environment and I cannot install any additional command line tools like miller.
In which case let's hope you have gawk - I have found the extensions in-built by GNU make it such a pain trying to use a lesser awk I now no longer even try to be posix compliant.

Last edited by syg00; 11-25-2021 at 03:08 AM. Reason: cleanup
 
1 members found this post helpful.
Old 11-25-2021, 03:04 AM   #5
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,688

Rep: Reputation: Disabled
Well, it largely depends on the structure of your input data. As long as it is as simple as in your example above, you can easily process them with awk. But CSV is a tricky format. If your data may contain embedded commas and/or double quotes, using Perl or Python with a CSV parsing module would be preferable. gawk may give you a remedy in form of FPAT/patsplit, but it won't account for all possible CSV quirks.

Another option is using a spreadsheet software. Most support pivot tables and will import from / export to CSV.

Last edited by shruggy; 11-25-2021 at 04:30 AM.
 
Old 11-25-2021, 03:28 AM   #6
curiouslinus
LQ Newbie
 
Registered: Nov 2021
Posts: 4

Original Poster
Rep: Reputation: Disabled
The input file is actually generated by another script. So I do have some control over the structure of the data.
Anyway, I have made some progress by combining a few awk examples I came across.

awk -F\, '
NR>1 {
if(!($1 in app)) { app[++num] = $1 }; app[$1]++
if(!($2 in infra)) { infra[++types] = $2 }; infra[$2]++
map[$2,$1] = $3
}
END {
printf "%s," ,"App";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, infra[ind];
sep = ","
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", app[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[infra[val], app[coun]];
}
print ""
}
}' input_file

Output
App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30
 
Old 11-25-2021, 04:20 AM   #7
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,688

Rep: Reputation: Disabled
Well, if order of cpu,mem,disk lines is always the same and you don't have to preserve the order of the results, this will do as well:
Code:
awk -F\, '
{ stats[$1] = stats[$1] FS $3 }
END {
  print"App,cpu,mem,disk"
  for (app in stats) print app stats[app]
}' input_file
That said, gawk allows you to control array scanning order.

I guess you want to keep running totals for each stat in the main awk loop as well, and print them at the end
Code:
totals[$2]+=$3

Last edited by shruggy; 11-25-2021 at 05:24 AM.
 
2 members found this post helpful.
Old 11-25-2021, 05:27 AM   #8
curiouslinus
LQ Newbie
 
Registered: Nov 2021
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks a lot for that!!

Really helpful.
 
Old 11-25-2021, 05:35 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,163

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
As shruggy demonstrated, it pays not to constrain your [g]awk coding with old-style language paradigms. It's quite fun to play with.

If you appreciate his posts, click on the "yes" button to enhance his rep - well deserved.
 
Old 12-03-2021, 07:57 AM   #10
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,391

Rep: Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765
Just for fun. (Called with awk -F\, -f in.awk -- in.txt).
Code:
BEGIN {
# For array subscripts
  x = "A"
  y = "B"
  z = "C"
}
{
# Read the file
# Collect unique entries
  a[$1]++
  b[$2]++

# Collect data in an array
  d[NR, x] = $1
  d[NR, x, y] = $2
  d[NR, x, y, z] = $3
}
END {
# Print column headers
  ORS=""
  print "App"
  for (j in b) {
    print " " j
  }
  print "\n"
# Print data in transformed format
  for (i in a) {
    print i
    for (j in b) {
      for (n=1; n<=NR; n++) {
        if (d[n, x] == i && d[n, x, y] == j) {
          print ","  d[n, x, y, z]
# Save running totals
          t[j] += d[n, x, y, z]
        }
      }
    }
  print "\n"
  }
# Print totals
  print "Sum"
  for (j in t) {
    print "," t[j]
  }
  print "\n"
# Print 100 -Totals
  print "Bal"
  for (j in t) {
    print "," 100-t[j]
  }
  print "\n"
}

Last edited by allend; 12-03-2021 at 08:01 AM.
 
Old 12-04-2021, 09:00 AM   #11
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,688

Rep: Reputation: Disabled
@allend.
Code:
print " " j
probably should be
Code:
print "," j
But more importantly,
Quote:
Originally Posted by The GNU Awk User’s Guide
The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within awk and in standard awk cannot be controlled or changed.
Here are the results of running your script (with the change above applied):

gawk 5.1.1:
Code:
App,mem,disk,cpu
def,10,50,5
abc,20,10,10
xyz,40,30,20
Sum,70,90,35
Bal,30,10,65
mawk 1.3.4^20200120:
Code:
App,cpu,disk,mem
abc,10,10,20
def,5,50,10
xyz,20,30,40
Sum,35,90,70
Bal,65,10,30
busybox 1.33.0:
Code:
App,cpu,mem,disk
abc,10,20,10
def,5,10,50
xyz,20,40,30
Sum,35,70,90
Bal,65,30,10
BWK awk (aka OneTrueAwk aka original-awk) 20210215:
Code:
App,disk,cpu,mem
abc,10,10,20
def,50,5,10
xyz,30,20,40
Sum,90,35,70
Bal,10,65,30

Quote:
Originally Posted by curiouslinus View Post
So I do have some control over the structure of the data.
It would make sense to generate input data sorted in lexicographical order then. I.e. cpu,disk,mem rather than cpu,mem,disk. This way the gawk script could be greatly simplified:
Code:
#!/usr/bin/gawk -f
# Assuming the input data are already sorted (sort -t,)
BEGIN { FS=","; PROCINFO["sorted_in"]="@ind_str_asc" }
{
  stats[$1] = stats[$1] FS $3
  totals[$2] += $3
}
END {
  print "App,cpu,disk,mem"
  for (app in stats) print app stats[app]
  printf "Bal"
  for (t in totals) printf ",%s", 100-totals[t]
  print ""
}
Speaking of fun. Python, of course, has the standard module csv, but reshaping table data with pandas or agate definitely makes more fun. Here it is with agate:
Code:
#!/usr/bin/python3
import agate

c_names=('App', 'stat', 'val')
table = agate.Table.from_csv('in.csv',
                             column_names=c_names,
                             header=None)
result = table.denormalize(*c_names)
totals = result.aggregate([
            (1, agate.Sum('cpu')),
            (2, agate.Sum('mem')),
            (3, agate.Sum('disk'))
        ]).values()

result.print_csv()
print('Bal', *[100-x for x in totals], sep=',')
 
Old 12-05-2021, 09:34 AM   #12
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,391

Rep: Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765
@shruggy - Thanks for taking the time on my experimentation with multidimensional arrays.
Nice catch on the
Code:
print "," j
The sorting with gawk can be addressed using 'asorti', but that appears to be an unsupported function in busybox awk.
I do not have mawk or BWK.
Code:
# Print column headers
  k=asorti(a, f)
  m=asorti(b, g)
  ORS=""
  print "App"
  for (j=1; j<=m; j++) {
    print "," g[j]
  }
  print "\n"
# Print data in transformed format
  for (i=1; i<=k; i++) {
    print f[i]
    for (j=1; j<=k; j++) {
     for (n=1; n<=NR; n++) {
        if (d[n, x] == f[i] && d[n, x, y] == g[j]) {
          print ","  d[n, x, y, z]
gawk 5.1.1
Code:
App,cpu,disk,mem
abc,10,10,20
def,5,50,10
xyz,20,30,40
Sum,35,90,70
Bal,65,10,30
 
  


Reply

Tags
csv, flatten, reshape



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
For multiple csv files how do I add the value of one particular entry in any given csv to that csv's name? sean mckinney Programming 8 01-22-2021 09:46 AM
How to print lines in csv file if 1 csv column field = "text". There are 10 column (;) in csv file nexuslinux Linux - Newbie 9 04-22-2016 11:35 PM
awk - rearrange column data in csv file to match columns wolverene13 Programming 9 12-21-2011 04:55 AM
Map 1 CSV's columns to matching columns in another CSV 2legit2quit Programming 7 10-27-2011 08:53 AM
Rearrange the column in CSV file.. govi1234 Linux - Newbie 9 08-27-2010 07:15 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 04:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration