LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   parsing a text file - to awk or not to awk ? (http://www.linuxquestions.org/questions/programming-9/parsing-a-text-file-to-awk-or-not-to-awk-897594/)

rollyah 08-16-2011 07:59 AM

parsing a text file - to awk or not to awk ?
 
Dear all,

i have a text file with a content as such:

Quote:

01 ,42052,2011-08-15 ,15:23:00,Pass
01 ,43676,2011-08-15 ,09:00:00,Pass
01 ,43676,2011-08-15 ,13:19:00,Pass
01 ,43676,2011-08-15 ,13:28:00,Pass
01 ,43676,2011-08-15 ,16:19:00,Pass
01 ,44246,2011-08-15 ,08:56:00,Pass
01 ,44246,2011-08-15 ,09:25:00,Pass
01 ,44246,2011-08-15 ,09:30:00,Pass
01 ,44246,2011-08-15 ,13:32:00,Pass
01 ,44246,2011-08-15 ,13:44:00,Pass
01 ,44246,2011-08-15 ,15:44:00,Pass
01 ,45756,2011-08-15 ,09:13:00,Pass
01 ,45756,2011-08-15 ,16:14:00,Pass
01 ,54545,2011-08-15 ,08:52:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
What i want is the following:

ID 44246 came in at 08:56:00 and left at 15:44:00 totaling 6 hours and 31 mins.

i've started with just parsing through the text file and trying to calculate the total of hours where i got stuck as u can see below:
Quote:


ID=(52019
30580
2801
54391
53283
44246
45756
43676
58989
22456
22272
54545
22671
50018
5713
15872
54853
29972
44627
2877
6771
30884
42052
29670
2530)

for (( i=0; i<${#ID[@]}; i++ ));
do
more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'
as you can see i have trouble with the logic.. any help would be appreciated.

NB: is it possible to have a two dimensional array in shell? if yes, that would solve this for me as i can change the loop and read line by line while adding the $4 time to it's respectives ID.

acid_kewpie 08-16-2011 08:38 AM

You don't need a 2d array here, you can but there's no need really.

so pseudo awkish code i'd probably use would be something like
Code:

{
  id = $2
  date = $3
  time = $4
  epoch = convert_to_epoch(time)
  if ( intime[id] > epoch ) {
    intime[id]  = epoch
  }
  if ( outtime[id] < epoch ) {
    outtime[id]  = epoch
  }
}
END {
  for (id in intime)
    print intime[id] " blah blah " outtime[id]
}

http://www.gnu.org/s/gawk/manual/htm...Functions.html

EDIT - the fields are in order aren't they? I thought there were only two ID's there, but there are more. that's even easier, no need to compare anything, just take the first value, and the last for each ID.

theNbomr 08-16-2011 08:43 AM

Is the data always sorted on the second field, evidently 'ID'? Does the time calculation ever have to deal with the likes of change of date, change of daylight savings, etc.? You posted data, but did not use [code] [/code] tags, so formatting may have been altered. Please post your sample data in code tags (same for your code). Is a Perl solution acceptable? (time oriented calculations may be better handled with Perl).

--- rod.

grail 08-16-2011 09:40 AM

Here is another just awk alternative:
Code:

#!/usr/bin/awk -f

BEGIN{ FS = " *, *" }

!x && $2 == 44246{
    x=1
    print "Start time: "$4
   
    split($3,d,"-")
    split($4, t, ":")
   
    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])

    st = mktime(tmp)
}

x && $2 != 44246{
    print "End time: "lastt
   
    split(lasty, d, "-")
    split(lastt, t, ":")
   
    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])
   
    ft = mktime(tmp)

    mod = ((tot = ft - st) > 86400?int(tot/86400)" day(s) ":"")"%T"

    print strftime("The time spent is "mod, tot, 1)

    exit
}

x{ lasty = $3
    lastt = $4 }

This also assumes the data is sorted by ID and then date / time.

rollyah 08-17-2011 12:43 AM

Code:

01 ,42052,2011-08-15 ,15:23:00,Pass
01 ,43676,2011-08-15 ,09:00:00,Pass
01 ,43676,2011-08-15 ,13:19:00,Pass
01 ,43676,2011-08-15 ,13:28:00,Pass
01 ,43676,2011-08-15 ,16:19:00,Pass
01 ,44246,2011-08-15 ,08:56:00,Pass
01 ,44246,2011-08-15 ,09:25:00,Pass
01 ,44246,2011-08-15 ,09:30:00,Pass
01 ,44246,2011-08-15 ,13:32:00,Pass
01 ,44246,2011-08-15 ,13:44:00,Pass
01 ,44246,2011-08-15 ,15:44:00,Pass
01 ,45756,2011-08-15 ,09:13:00,Pass
01 ,45756,2011-08-15 ,16:14:00,Pass
01 ,54545,2011-08-15 ,08:52:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass

Code:

ID=(52019
30580
2801
54391
53283
44246
45756
43676
58989
22456
22272
54545
22671
50018
5713
15872
54853
29972
44627
2877
6771
30884
42052
29670
2530)

for (( i=0; i<${#ID[@]}; i++ ));
do
more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'

thanks for all of your replies.
i'll be answering all questions below:



Chris,
There are more than 100 ID, with different dates.

Rod,
Yes, the output is always the same. fields are always ordered as shown above. and the date will change as the file will be exported every day at midnight, and the calculation will happen then.
Time saving shouldn't be taken into consideration as the most important part for me is the number of hours the person spent INSIDE. the rest of the info is secondary.
While searching for a way to solve this, i came across a number of threads confirming your advice (using perl ) though i'm not familiar with it yet.

grail,
May i bother you in explaining the steps above?
i'm currently reading "awk & sed" book so i get my head around awk and what you posted is way too advanced for me to get on my own


NOTE: i'll have to calculate the hours spent for every id and not just 44246.

Thank you all for your help so far.

Best,

--Roland

grail 08-17-2011 03:47 AM

This one seems to take care of the data for each ID:
Code:

#!/usr/bin/awk -f

BEGIN{ FS = " *, *" }    # sets the delimiter as a comma with zero or more spaces before or after

x != $2{    # Check for when the ID changes

    if( lastt ){    # if last time is set, ie only after getting first ID
        print "End time: "lastt

        ft = gettime(lasty, lastt)    # Get final time in a format that awk can understand

        print strftime("The time spent is %T", tot, 1)    # print the output in HH:MM:SS format
    }

    x = $2    # Set current ID
    print "ID:",x,"Start time: "$4

    st = gettime($3, $4)    # Get start time in a format that awk can understand
}

# Store date and time to be used on changing ID
    lasty = $3
    lastt = $4
}

END{    # The last ID will need to be processed here (same as if above)
    print "End time: "lastt

    ft = gettime(lasty, lastt)

    print strftime("The time spent is %T", tot, 1)
}

function gettime(date, time,    tmp, d, t)
{
    split(date, d, "-")    # Split the date into array d based on dashes
    split(time, t, ":")    # Split the time into array t based on colons

    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])    # mktime function requires the data in this order

    return mktime(tmp)    # return time based on fields of the combined arrays
}

Let me know if you need more explanation? You can get all the details for this from here

Nominal Animal 08-17-2011 11:45 AM

Here's the gawk script I'd probably use:
Code:

#!/usr/bin/gawk -f

BEGIN {
        RS="[\n\t\v\f\r ]*[\n\r][\n\t\v\f\r ]*"
        FS="[\t ]*[,][\t ]*"

        split("", idlist)
        split("", daylist)
        split("", pass)
}

# Add an int value to a sorted string of space-separated ints,
# unless the int is already in the list.
# Returns the list as a space-separated string.
function mergeintstring(string, value) {
        result = ""
        v = int(value)
        n = split(string, list, " ")
        i = 1;

        while (i <= n && int(list[i]) < v)
                result = result " " int(list[i++])

        while (i <= n && int(list[i]) == v)
                i++

        result = result " " v

        while (i <= n)
                result = result " " int(list[i++])

        return substr(result, 2)
}

($5 == "Pass") {
        id = int($2)

        # Convert date to numeric value, YYYYMMDD
        day = $3
        gsub(/[^0-9]+/, "", day)
        day = int(day)

        # List key is a combination of the date and the user ID
        key = day " " id

        # Add user ID and date to known lists
        idlist[id] = id
        daylist[day] = day

        # Convert time to seconds
        split($4, list, ":")
        sec = int(60 * 60 * list[1] + 60 * list[2] + list[3])

        # Add the pass time in seconds to the sorted list
        # for this user and this date
        pass[key] = mergeintstring(pass[key], sec)
}

END {
        # Sort the known users and dates in ascending order
        ids = asort(idlist)
        days = asort(daylist)

        for (iday = 1; iday <= days; iday++) {
                day = daylist[iday]
                daystr = sprintf("%d-%02d-%02d", int(day / 10000), int(day / 100) % 100, int(day % 100))

                for (iid = 1; iid <= ids; iid++) {
                        id = idlist[iid]
                        key = day " " id

                        if (key in pass) {

                                # Split the list of pass seconds, and calculate total
                                n = split(pass[key], list, " ")
                                enter = list[1]
                                leave = list[1]
                                total = 0
                                for (i = 1; i < n; i += 2) {
                                        leave = list[i + 1]
                                        total += leave - list[i]
                                }

                                # Convert to minutes (round, not truncate)
                                enter = int(enter / 60 + 0.5)
                                leave = int(leave / 60 + 0.5)
                                total = int(total / 60 + 0.5)
                                final = int(list[n] / 60 + 0.5)

                                if (n % 2)
                                        printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d ! unpaired pass at %02d:%02d\n",
                                              daystr, id,
                                              int(enter / 60), int(enter % 60),
                                              int(leave / 60), int(leave % 60),
                                              int(total / 60), int(total % 60),
                                              int(final / 60), int(final % 60) )
                                else
                                        printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d\n",
                                              daystr, id,
                                              int(enter / 60), int(enter % 60),
                                              int(leave / 60), int(leave % 60),
                                              int(total / 60), int(total % 60) )
                        } else
                                printf("%s ID %s: Out\n", daystr, id)
                }

                # Add an empty line after each date
                printf("\n")
        }
}

The order of the input does not matter to this script at all; it does not need to be sorted at all. This script also ignores duplicate input records. (I am assuming a successful "Pass" takes at least a second, i.e. the same user cannot pass twice in the same second.)

It first reads in the entire file, storing each pass event in a sorted list (string), keyed by date and ID. All dates and IDs seen are also kept in separate lists for display purposes. (The mergeintstring() function takes a string containing a space-separated list of ints, sorted in ascending order, and inserts the int into it unless the string already contains it. The function returns the list as a space-separated string.)

The END rule will loop thorough each date. If the user did not pass that day, then only "Out" is printed for that user. Otherwise the sorted list of pass times is processed. If there is an unpaired pass, it will be ignored in the work time calculations; it will be shown, though. If an unpaired pass occurs, it will always be of course the last event for that day. The last printf() in the script will add an empty line between dates; you can freely remove it (and of course change the output formats) as you wish.

The output for your example input is
Code:

2011-08-15 ID 42052: 15:23 - 15:23 for 00:00 ! unpaired pass at 15:23
2011-08-15 ID 43676: 09:00 - 16:19 for 07:10
2011-08-15 ID 44246: 08:56 - 15:44 for 06:31
2011-08-15 ID 45756: 09:13 - 16:14 for 07:01
2011-08-15 ID 54545: 08:52 - 16:20 for 07:28


rollyah 08-18-2011 04:24 AM

thank you for your help
both of your advices proved useful.
and i'm using it.

sundialsvcs 08-18-2011 09:00 AM

Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.

When records are sorted, you know two things for certain:
  1. All occurrences of a given key-value are adjacent.
  2. If any gaps exist, the records within that gap are nonexistent.
You can process arbitrarily large data-sets (whups, there goes my 1960's-era batch mainframe terminology again...) considering only two records at a time. And that, my good friend, is huge.

Furthermore, "external sorting" is one of the most heavily-studied algorithms out there ... hence the title of Dr. Knuth's book is Sorting and Searching. You can sort multiple millions of records today in mere seconds, using external programs that are lovingly crafted for the purpose.

Think back to the days of punched cards, long before digital computers existed. Those were the techniques that were used back then. That's what "all those spinning reels of tape" were for. It worked then, and it still works now.

Nominal Animal 08-18-2011 02:20 PM

Quote:

Originally Posted by sundialsvcs (Post 4446675)
Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.

A semi-rant follows :p

I disagree, for two reasons:
  1. There is usually no predetermined order in the data.
    Preferred data order depends on the algorithm used to process the data. Imposing a set order to the data limits algorithm choices, and therefore hinders modularity.
    In this specific case, do you sort by user ID, by date, or by string (as lines)?
  2. There usually is a variant of the algorithm that is order insensitive.
    Either there is no sorting done at all, or sorting can be done as a by-product of another operation, at smaller cost that is otherwise possible.

In my gawk script, the unique dates and unique IDs are sorted as lists (for output purposes only -- to get the output sorted by dates and IDs), but the individual events are not. For each ID on each date a separate list of pass events is kept in a sorted list. Due to gawk limitations, that list is a string, and the insert operation is less than optimal.. but there is no actual sort of all events done. As an example, you can switch the output order from dates to IDs by just swapping the outer loops in the END rule.

I'd say my algorithm works a bit like a radix sort on the data. Although quicksort is faster on small data sets, a radix sort is O(N) and will always be faster on large enough data sets. I do realize sundialsvcs was more concerned about memory usage than CPU time used; I just want to point out there are other, perhaps more important considerations here. In this case I consider even CPU time secondary; simplicity, modularity, and ease of adaptation was my principal goals here.

As to memory use, every sorting method requires access to every element. External sorting methods do this in blocks, with data dynamically grouped to "local" and "external" (changing as the algorithm progresses), with "local" data worked on. gawk arrays are limited to RAM+SWAP, but on most machines, that is enough for truly large datasets. In this specific case, a gigabyte of input data contains 25 million pass records; that is almost 300 pass events per second for a full day.

For a larger system, the algorithm I used in my gawk script could be reimplemented in C99, using the memory mapping techniques I've demonstrated earlier. On a 64-bit platform that implementation would basically be limited only by maximum file size. The example program demonstrates a terabyte data set; that maps to 792 pass events per second for a year, 24/7.

Unless I'm severely mistaken, the access patterns would be very efficient, as long as the input has order. If the input is more or less sorted either by date or by user ID, the kernel would write to each file block just once on average, if there is enough RAM. If not, the kernel will have to do extra I/O, but it'll still work fine; it will just take longer.

In general I very emphatically agree with sundialsvcs that one should not forget about all the experience and algorithms developed earlier, specifically those developed to work with severe memory and CPU restrictions. However, I personally feel we should really understand the boundary conditions that applied to those, and recognize that many of them no longer apply. The virtual memory capabilities of 64-bit platforms -- being able to map terabytes of data into memory, and letting the kernel worry about which bits to keep in RAM, and which bits on disk -- is one key example.

I work with large datasets, with simulators that are typically cyclically CPU-starved, then I/O-starved. They most definitely do not have to be, if more efficient algorithms (and workload management) were used. The basic argument against changes is typically "but this works, and is stable; we do not want to change it". I can accept that attitude for existing software, but not for new software. I think we should build and improve upon existing knowledge -- and not just rely on it.

I apologise for the semi-rant. I hope sundialsvcs is not too offended; I'm just frustrated because most programmers do not realize many of the physical limits that have restricted algorithms and their development and implementations no longer apply.


All times are GMT -5. The time now is 10:50 PM.