Script / two files and matching multiple columes

sharky · 10-21-2008, 01:59 PM

I have two files. One is a generated report that list jobs executed on our load sharing facility. It has a column with userids and how much CPU time was spent on each job. Any user could be listed one or more times in no particular order.

example

u4 31.3
u1 61.3
u4 381.2
u3 1.5
u1 34.8
u1 0.3
u5 9.0
u2 111.1
etc...

The second file list each userid, their real name and the dept. they belong to.

example - as delimited but I can remove the double quotes if needed.

"u1" "John Doe" "D1"
"u2" "Jane Doe" "D1"
"u3" "Bart Simpson" "D2"
"u4" "Homer Simpson" "D1"
"u5" "Julius Ceasar" "D3"

The goal is a script to modify the first file so that the dept. and real name is added. I can do this in a spreadsheet but would prefer a more automated method.

I can come with something eventually but what I'm wondering is if there is some trivial awk or perl that would make it easy. I'm always looking for something easy. :-)

Disillusionist · 10-21-2008, 02:58 PM

Read the files into two seperate arrays
Check the first word from each line of the report array
Compare that to your reference array

Merge the relevant data from the reference and report arrays and create a new output file.

Was going to post code, but thought this smelled a little too much like homework.

If you get stuck, post the code that you have written and we will suggest where you may have gone wrong.

sharky · 10-21-2008, 03:18 PM

Quote:

Originally Posted by Disillusionist

Read the files into two seperate arrays
Check the first word from each line of the report array
Compare that to your reference array

Merge the relevant data from the reference and report arrays and create a new output file.

Was going to post code, but thought this smelled a little too much like homework.

If you get stuck, post the code that you have written and we will suggest where you may have gone wrong.

It ain't homework. I'm 48 years old and work work for a living. :-)

I can create a perl script or something to do what you describe. What really had me curious was the possibility of something less tedious.

For example in the raw data file I can calculate the total users CPU usage with a single line of awk:

cat ./usage.report | grep $U | awk '{SUM += $6} END {print $1, SUM}'

A foreach block is wrapped around it to parse through each user.

However, your 'algorithm' looks fairly straightforward and probably as simple as it'll get.

thx,

jcookeman · 10-21-2008, 03:24 PM

Code:

#!/usr/bin/env python2.5

from __future__ import with_statement

users = {}
entries = []

with open('users.map', 'r') as user_map:
    for line in user_map:
        entry = [str.strip('"') for str in line.split()]
        users[entry[0]] = {'uname':entry[1] + ' ' + entry[2],
                           'dept':entry[3]}

with open('job.log', 'r') as jobs:
    for line in jobs:
        uid = line.split()[0]
        entries.append(line.strip() + ' ' +
                       users[uid]['uname'] + ' ' +
                       users[uid]['dept'] + '\n')

print entries

sharky · 10-21-2008, 04:17 PM

Quote:

Originally Posted by jcookeman

Code:

#!/usr/bin/env python2.5

from __future__ import with_statement

users = {}
entries = []

with open('users.map', 'r') as user_map:
    for line in user_map:
        entry = [str.strip('"') for str in line.split()]
        users[entry[0]] = {'uname':entry[1] + ' ' + entry[2],
                           'dept':entry[3]}

with open('job.log', 'r') as jobs:
    for line in jobs:
        uid = line.split()[0]
        entries.append(line.strip() + ' ' +
                       users[uid]['uname'] + ' ' +
                       users[uid]['dept'] + '\n')

print entries

I used pearl. This a

Code:

#!/usr/bin/perl

open(USERS, "user_tbl.csv");
open(CPUTIME, "sum_user_cputime.csv");
  while ($users = <USERS>)
    {
      @listusers = split (/ /,$users);
      while ($cputime = <CPUTIME>)
      {
        @listcputime = split (/ /,$cputime);
        if ( "$listusers[0]" eq "$listcputime[0]" )
        {
          # had to chop a linefeed here
          chop $listusers[2];
          print "$listusers[0] $listusers[1] $listusers[2] $listcputime[1]";

        }
      }
      seek CPUTIME, 0, 0;
    }
close USERS;
close CPUTIME;

How do you like python? I here a lot of good things about it - except from the perl worshippers. :-(

jcookeman · 10-21-2008, 05:05 PM

Perl is excellent, but I believe Python is more elegant. I, however, am not a member of any zealot movement. So, use what makes you comfortable.

...that doesn't mean from time to time I don't take a stab at others' expense.

forrestt · 10-21-2008, 05:19 PM

If you can easily get file2 to look like:

Code:

u1 "John Doe" "D1"
u2 "Jane Doe" "D1"
u3 "Bart Simpson" "D2"
u4 "Homer Simpson" "D1"
u5 "Julius Ceasar" "D3"

Then you can run:

Code:

awk 'NR==FNR{ users[$1]=$2" "$3" "$4; next } {print $1,users[$1],$2}' file2 file1

What it is doing is testing to see if the total record count (NR) equals the file record count (FNR) which will only be true during the reading of the first file. If it is, then it pushes fields 2, 3 and 4 into a users array with a space between them and goes onto the next record (so that it won't do the printing part). After the first file is read, it starts reading the second (and now NR != to FNR) so it doesn't hit next, and gets the printing part. It prints the first field, the values that were pushed into the users array at the location with an index of the first field, then the second field.