LinuxQuestions.org - parsing a text file

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - parsing a text file - to awk or not to awk ? (https://www.linuxquestions.org/questions/programming-9/parsing-a-text-file-to-awk-or-not-to-awk-897594/)

parsing a text file - to awk or not to awk ?

Dear all,

i have a text file with a content as such:

Quote:

01 ,42052,2011-08-15 ,15:23:00,Pass
01 ,43676,2011-08-15 ,09:00:00,Pass
01 ,43676,2011-08-15 ,13:19:00,Pass
01 ,43676,2011-08-15 ,13:28:00,Pass
01 ,43676,2011-08-15 ,16:19:00,Pass
01 ,44246,2011-08-15 ,08:56:00,Pass
01 ,44246,2011-08-15 ,09:25:00,Pass
01 ,44246,2011-08-15 ,09:30:00,Pass
01 ,44246,2011-08-15 ,13:32:00,Pass
01 ,44246,2011-08-15 ,13:44:00,Pass
01 ,44246,2011-08-15 ,15:44:00,Pass
01 ,45756,2011-08-15 ,09:13:00,Pass
01 ,45756,2011-08-15 ,16:14:00,Pass
01 ,54545,2011-08-15 ,08:52:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass

What i want is the following:

ID 44246 came in at 08:56:00 and left at 15:44:00 totaling 6 hours and 31 mins.

i've started with just parsing through the text file and trying to calculate the total of hours where i got stuck as u can see below:

Quote:

ID=(52019
30580
2801
54391
53283
44246
45756
43676
58989
22456
22272
54545
22671
50018
5713
15872
54853
29972
44627
2877
6771
30884
42052
29670
2530)

for (( i=0; i<${#ID[@]}; i++ ));
do
more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'

as you can see i have trouble with the logic.. any help would be appreciated.

NB: is it possible to have a two dimensional array in shell? if yes, that would solve this for me as i can change the loop and read line by line while adding the $4 time to it's respectives ID.

You don't need a 2d array here, you can but there's no need really.

so pseudo awkish code i'd probably use would be something like

Code:

{

  id = $2

  date = $3

  time = $4

  epoch = convert_to_epoch(time)

  if ( intime[id] > epoch ) {

    intime[id]  = epoch

  }

  if ( outtime[id] < epoch ) {

    outtime[id]  = epoch

  }

}

END {

  for (id in intime)

    print intime[id] " blah blah " outtime[id]

}

http://www.gnu.org/s/gawk/manual/htm...Functions.html

EDIT - the fields are in order aren't they? I thought there were only two ID's there, but there are more. that's even easier, no need to compare anything, just take the first value, and the last for each ID.

Is the data always sorted on the second field, evidently 'ID'? Does the time calculation ever have to deal with the likes of change of date, change of daylight savings, etc.? You posted data, but did not use [code] [/code] tags, so formatting may have been altered. Please post your sample data in code tags (same for your code). Is a Perl solution acceptable? (time oriented calculations may be better handled with Perl).

--- rod.

Here is another just awk alternative:

Code:

#!/usr/bin/awk -f



BEGIN{ FS = " *, *" }



!x && $2 == 44246{

    x=1

    print "Start time: "$4

    

    split($3,d,"-")

    split($4, t, ":")

    

    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])



    st = mktime(tmp)

}



x && $2 != 44246{

    print "End time: "lastt

    

    split(lasty, d, "-")

    split(lastt, t, ":")

    

    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])

    

    ft = mktime(tmp)



    mod = ((tot = ft - st) > 86400?int(tot/86400)" day(s) ":"")"%T"



    print strftime("The time spent is "mod, tot, 1)



    exit

}



x{ lasty = $3

    lastt = $4 }

This also assumes the data is sorted by ID and then date / time.

Code:

01 ,42052,2011-08-15 ,15:23:00,Pass

01 ,43676,2011-08-15 ,09:00:00,Pass

01 ,43676,2011-08-15 ,13:19:00,Pass

01 ,43676,2011-08-15 ,13:28:00,Pass

01 ,43676,2011-08-15 ,16:19:00,Pass

01 ,44246,2011-08-15 ,08:56:00,Pass

01 ,44246,2011-08-15 ,09:25:00,Pass

01 ,44246,2011-08-15 ,09:30:00,Pass

01 ,44246,2011-08-15 ,13:32:00,Pass

01 ,44246,2011-08-15 ,13:44:00,Pass

01 ,44246,2011-08-15 ,15:44:00,Pass

01 ,45756,2011-08-15 ,09:13:00,Pass

01 ,45756,2011-08-15 ,16:14:00,Pass

01 ,54545,2011-08-15 ,08:52:00,Pass

01 ,54545,2011-08-15 ,16:20:00,Pass

01 ,54545,2011-08-15 ,16:20:00,Pass

Code:

ID=(52019

30580

2801

54391

53283

44246

45756

43676

58989

22456

22272

54545

22671

50018

5713

15872

54853

29972

44627

2877

6771

30884

42052

29670

2530)



for (( i=0; i<${#ID[@]}; i++ ));

do

more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'

thanks for all of your replies.
i'll be answering all questions below:

Chris,
There are more than 100 ID, with different dates.

Rod,
Yes, the output is always the same. fields are always ordered as shown above. and the date will change as the file will be exported every day at midnight, and the calculation will happen then.
Time saving shouldn't be taken into consideration as the most important part for me is the number of hours the person spent INSIDE. the rest of the info is secondary.
While searching for a way to solve this, i came across a number of threads confirming your advice (using perl ) though i'm not familiar with it yet.

grail,
May i bother you in explaining the steps above?
i'm currently reading "awk & sed" book so i get my head around awk and what you posted is way too advanced for me to get on my own

NOTE: i'll have to calculate the hours spent for every id and not just 44246.

Thank you all for your help so far.

Best,

--Roland

This one seems to take care of the data for each ID:

Code:

#!/usr/bin/awk -f



BEGIN{ FS = " *, *" }    # sets the delimiter as a comma with zero or more spaces before or after



x != $2{    # Check for when the ID changes



    if( lastt ){    # if last time is set, ie only after getting first ID

        print "End time: "lastt



        ft = gettime(lasty, lastt)    # Get final time in a format that awk can understand



        print strftime("The time spent is %T", tot, 1)    # print the output in HH:MM:SS format

    }



    x = $2    # Set current ID

    print "ID:",x,"Start time: "$4



    st = gettime($3, $4)    # Get start time in a format that awk can understand

}



{  # Store date and time to be used on changing ID

    lasty = $3

    lastt = $4

}



END{    # The last ID will need to be processed here (same as if above)

    print "End time: "lastt



    ft = gettime(lasty, lastt)



    print strftime("The time spent is %T", tot, 1)

}



function gettime(date, time,    tmp, d, t)

{

    split(date, d, "-")    # Split the date into array d based on dashes

    split(time, t, ":")    # Split the time into array t based on colons



    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])    # mktime function requires the data in this order



    return mktime(tmp)    # return time based on fields of the combined arrays

}

Let me know if you need more explanation? You can get all the details for this from here

Here's the gawk script I'd probably use:

Code:

#!/usr/bin/gawk -f



BEGIN {

        RS="[\n\t\v\f\r ]*[\n\r][\n\t\v\f\r ]*"

        FS="[\t ]*[,][\t ]*"



        split("", idlist)

        split("", daylist)

        split("", pass)

}



# Add an int value to a sorted string of space-separated ints,

# unless the int is already in the list.

# Returns the list as a space-separated string.

function mergeintstring(string, value) {

        result = ""

        v = int(value)

        n = split(string, list, " ")

        i = 1;



        while (i <= n && int(list[i]) < v)

                result = result " " int(list[i++])



        while (i <= n && int(list[i]) == v)

                i++



        result = result " " v



        while (i <= n)

                result = result " " int(list[i++])



        return substr(result, 2)

}



($5 == "Pass") {

        id = int($2)



        # Convert date to numeric value, YYYYMMDD

        day = $3

        gsub(/[^0-9]+/, "", day)

        day = int(day)



        # List key is a combination of the date and the user ID

        key = day " " id



        # Add user ID and date to known lists

        idlist[id] = id

        daylist[day] = day



        # Convert time to seconds

        split($4, list, ":")

        sec = int(60 * 60 * list[1] + 60 * list[2] + list[3])



        # Add the pass time in seconds to the sorted list

        # for this user and this date

        pass[key] = mergeintstring(pass[key], sec)

}



END {

        # Sort the known users and dates in ascending order

        ids = asort(idlist)

        days = asort(daylist)



        for (iday = 1; iday <= days; iday++) {

                day = daylist[iday]

                daystr = sprintf("%d-%02d-%02d", int(day / 10000), int(day / 100) % 100, int(day % 100))



                for (iid = 1; iid <= ids; iid++) {

                        id = idlist[iid]

                        key = day " " id



                        if (key in pass) {



                                # Split the list of pass seconds, and calculate total

                                n = split(pass[key], list, " ")

                                enter = list[1]

                                leave = list[1]

                                total = 0

                                for (i = 1; i < n; i += 2) {

                                        leave = list[i + 1]

                                        total += leave - list[i]

                                }



                                # Convert to minutes (round, not truncate)

                                enter = int(enter / 60 + 0.5)

                                leave = int(leave / 60 + 0.5)

                                total = int(total / 60 + 0.5)

                                final = int(list[n] / 60 + 0.5)



                                if (n % 2)

                                        printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d ! unpaired pass at %02d:%02d\n",

                                              daystr, id,

                                              int(enter / 60), int(enter % 60),

                                              int(leave / 60), int(leave % 60),

                                              int(total / 60), int(total % 60),

                                              int(final / 60), int(final % 60) )

                                else

                                        printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d\n",

                                              daystr, id,

                                              int(enter / 60), int(enter % 60),

                                              int(leave / 60), int(leave % 60),

                                              int(total / 60), int(total % 60) )

                        } else

                                printf("%s ID %s: Out\n", daystr, id)

                }



                # Add an empty line after each date

                printf("\n")

        }

}

The order of the input does not matter to this script at all; it does not need to be sorted at all. This script also ignores duplicate input records. (I am assuming a successful "Pass" takes at least a second, i.e. the same user cannot pass twice in the same second.)

It first reads in the entire file, storing each pass event in a sorted list (string), keyed by date and ID. All dates and IDs seen are also kept in separate lists for display purposes. (The mergeintstring() function takes a string containing a space-separated list of ints, sorted in ascending order, and inserts the int into it unless the string already contains it. The function returns the list as a space-separated string.)

The END rule will loop thorough each date. If the user did not pass that day, then only "Out" is printed for that user. Otherwise the sorted list of pass times is processed. If there is an unpaired pass, it will be ignored in the work time calculations; it will be shown, though. If an unpaired pass occurs, it will always be of course the last event for that day. The last printf() in the script will add an empty line between dates; you can freely remove it (and of course change the output formats) as you wish.

The output for your example input is

Code:

2011-08-15 ID 42052: 15:23 - 15:23 for 00:00 ! unpaired pass at 15:23

2011-08-15 ID 43676: 09:00 - 16:19 for 07:10

2011-08-15 ID 44246: 08:56 - 15:44 for 06:31

2011-08-15 ID 45756: 09:13 - 16:14 for 07:01

2011-08-15 ID 54545: 08:52 - 16:20 for 07:28

thank you for your help
both of your advices proved useful.
and i'm using it.

Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.

When records are sorted, you know two things for certain:

All occurrences of a given key-value are adjacent.
If any gaps exist, the records within that gap are nonexistent.

You can process arbitrarily large data-sets (whups, there goes my 1960's-era batch mainframe terminology again...) considering only two records at a time. And that, my good friend, is huge.

Furthermore, "external sorting" is one of the most heavily-studied algorithms out there ... hence the title of Dr. Knuth's book is Sorting and Searching. You can sort multiple millions of records today in mere seconds, using external programs that are lovingly crafted for the purpose.

Think back to the days of punched cards, long before digital computers existed. Those were the techniques that were used back then. That's what "all those spinning reels of tape" were for. It worked then, and it still works now.

Quote:

Originally Posted by sundialsvcs (Post 4446675)

Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.

A semi-rant follows :p

I disagree, for two reasons:

There is usually no predetermined order in the data.
Preferred data order depends on the algorithm used to process the data. Imposing a set order to the data limits algorithm choices, and therefore hinders modularity.
In this specific case, do you sort by user ID, by date, or by string (as lines)?
There usually is a variant of the algorithm that is order insensitive.
Either there is no sorting done at all, or sorting can be done as a by-product of another operation, at smaller cost that is otherwise possible.

In my gawk script, the unique dates and unique IDs are sorted as lists (for output purposes only -- to get the output sorted by dates and IDs), but the individual events are not. For each ID on each date a separate list of pass events is kept in a sorted list. Due to gawk limitations, that list is a string, and the insert operation is less than optimal.. but there is no actual sort of all events done. As an example, you can switch the output order from dates to IDs by just swapping the outer loops in the END rule.

I'd say my algorithm works a bit like a radix sort on the data. Although quicksort is faster on small data sets, a radix sort is O(N) and will always be faster on large enough data sets. I do realize sundialsvcs was more concerned about memory usage than CPU time used; I just want to point out there are other, perhaps more important considerations here. In this case I consider even CPU time secondary; simplicity, modularity, and ease of adaptation was my principal goals here.

As to memory use, every sorting method requires access to every element. External sorting methods do this in blocks, with data dynamically grouped to "local" and "external" (changing as the algorithm progresses), with "local" data worked on. gawk arrays are limited to RAM+SWAP, but on most machines, that is enough for truly large datasets. In this specific case, a gigabyte of input data contains 25 million pass records; that is almost 300 pass events per second for a full day.

For a larger system, the algorithm I used in my gawk script could be reimplemented in C99, using the memory mapping techniques I've demonstrated earlier. On a 64-bit platform that implementation would basically be limited only by maximum file size. The example program demonstrates a terabyte data set; that maps to 792 pass events per second for a year, 24/7.

Unless I'm severely mistaken, the access patterns would be very efficient, as long as the input has order. If the input is more or less sorted either by date or by user ID, the kernel would write to each file block just once on average, if there is enough RAM. If not, the kernel will have to do extra I/O, but it'll still work fine; it will just take longer.

In general I very emphatically agree with sundialsvcs that one should not forget about all the experience and algorithms developed earlier, specifically those developed to work with severe memory and CPU restrictions. However, I personally feel we should really understand the boundary conditions that applied to those, and recognize that many of them no longer apply. The virtual memory capabilities of 64-bit platforms -- being able to map terabytes of data into memory, and letting the kernel worry about which bits to keep in RAM, and which bits on disk -- is one key example.

I work with large datasets, with simulators that are typically cyclically CPU-starved, then I/O-starved. They most definitely do not have to be, if more efficient algorithms (and workload management) were used. The basic argument against changes is typically "but this works, and is stable; we do not want to change it". I can accept that attitude for existing software, but not for new software. I think we should build and improve upon existing knowledge -- and not just rely on it.

I apologise for the semi-rant. I hope sundialsvcs is not too offended; I'm just frustrated because most programmers do not realize many of the physical limits that have restricted algorithms and their development and implementations no longer apply.