LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-16-2011, 07:59 AM   #1
rollyah
LQ Newbie
 
Registered: Jan 2011
Posts: 15

Rep: Reputation: 1
Thumbs down parsing a text file - to awk or not to awk ?


Dear all,

i have a text file with a content as such:

Quote:
01 ,42052,2011-08-15 ,15:23:00,Pass
01 ,43676,2011-08-15 ,09:00:00,Pass
01 ,43676,2011-08-15 ,13:19:00,Pass
01 ,43676,2011-08-15 ,13:28:00,Pass
01 ,43676,2011-08-15 ,16:19:00,Pass
01 ,44246,2011-08-15 ,08:56:00,Pass
01 ,44246,2011-08-15 ,09:25:00,Pass
01 ,44246,2011-08-15 ,09:30:00,Pass
01 ,44246,2011-08-15 ,13:32:00,Pass
01 ,44246,2011-08-15 ,13:44:00,Pass
01 ,44246,2011-08-15 ,15:44:00,Pass
01 ,45756,2011-08-15 ,09:13:00,Pass
01 ,45756,2011-08-15 ,16:14:00,Pass
01 ,54545,2011-08-15 ,08:52:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
What i want is the following:

ID 44246 came in at 08:56:00 and left at 15:44:00 totaling 6 hours and 31 mins.

i've started with just parsing through the text file and trying to calculate the total of hours where i got stuck as u can see below:
Quote:

ID=(52019
30580
2801
54391
53283
44246
45756
43676
58989
22456
22272
54545
22671
50018
5713
15872
54853
29972
44627
2877
6771
30884
42052
29670
2530)

for (( i=0; i<${#ID[@]}; i++ ));
do
more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'
as you can see i have trouble with the logic.. any help would be appreciated.

NB: is it possible to have a two dimensional array in shell? if yes, that would solve this for me as i can change the loop and read line by line while adding the $4 time to it's respectives ID.
 
Old 08-16-2011, 08:38 AM   #2
acid_kewpie
Moderator
 
Registered: Jun 2001
Location: UK
Distribution: Gentoo, RHEL, Fedora, Centos
Posts: 43,398

Rep: Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964Reputation: 1964
You don't need a 2d array here, you can but there's no need really.

so pseudo awkish code i'd probably use would be something like
Code:
{
  id = $2
  date = $3
  time = $4
  epoch = convert_to_epoch(time)
  if ( intime[id] > epoch ) {
     intime[id]  = epoch
  }
  if ( outtime[id] < epoch ) {
     outtime[id]  = epoch
  }
}
END {
  for (id in intime)
    print intime[id] " blah blah " outtime[id]
}
http://www.gnu.org/s/gawk/manual/htm...Functions.html

EDIT - the fields are in order aren't they? I thought there were only two ID's there, but there are more. that's even easier, no need to compare anything, just take the first value, and the last for each ID.

Last edited by acid_kewpie; 08-16-2011 at 08:49 AM.
 
Old 08-16-2011, 08:43 AM   #3
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
Is the data always sorted on the second field, evidently 'ID'? Does the time calculation ever have to deal with the likes of change of date, change of daylight savings, etc.? You posted data, but did not use [code] [/code] tags, so formatting may have been altered. Please post your sample data in code tags (same for your code). Is a Perl solution acceptable? (time oriented calculations may be better handled with Perl).

--- rod.
 
Old 08-16-2011, 09:40 AM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,516

Rep: Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896
Here is another just awk alternative:
Code:
#!/usr/bin/awk -f

BEGIN{ FS = " *, *" }

!x && $2 == 44246{
    x=1
    print "Start time: "$4
    
    split($3,d,"-")
    split($4, t, ":")
    
    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])

    st = mktime(tmp)
}

x && $2 != 44246{
    print "End time: "lastt
    
    split(lasty, d, "-")
    split(lastt, t, ":")
    
    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])
    
    ft = mktime(tmp)

    mod = ((tot = ft - st) > 86400?int(tot/86400)" day(s) ":"")"%T"

    print strftime("The time spent is "mod, tot, 1)

    exit
}

x{ lasty = $3
    lastt = $4 }
This also assumes the data is sorted by ID and then date / time.
 
Old 08-17-2011, 12:43 AM   #5
rollyah
LQ Newbie
 
Registered: Jan 2011
Posts: 15

Original Poster
Rep: Reputation: 1
Code:
01 ,42052,2011-08-15 ,15:23:00,Pass
01 ,43676,2011-08-15 ,09:00:00,Pass
01 ,43676,2011-08-15 ,13:19:00,Pass
01 ,43676,2011-08-15 ,13:28:00,Pass
01 ,43676,2011-08-15 ,16:19:00,Pass
01 ,44246,2011-08-15 ,08:56:00,Pass
01 ,44246,2011-08-15 ,09:25:00,Pass
01 ,44246,2011-08-15 ,09:30:00,Pass
01 ,44246,2011-08-15 ,13:32:00,Pass
01 ,44246,2011-08-15 ,13:44:00,Pass
01 ,44246,2011-08-15 ,15:44:00,Pass
01 ,45756,2011-08-15 ,09:13:00,Pass
01 ,45756,2011-08-15 ,16:14:00,Pass
01 ,54545,2011-08-15 ,08:52:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
01 ,54545,2011-08-15 ,16:20:00,Pass
Code:
ID=(52019
30580
2801
54391
53283
44246
45756
43676
58989
22456
22272
54545
22671
50018
5713
15872
54853
29972
44627
2877
6771
30884
42052
29670
2530)

for (( i=0; i<${#ID[@]}; i++ ));
do
more file | grep ${ID[$i]} | awk -F, '{total = a[$4]; total = total - a[$4]++} { print "ID" ${ID[$i]} "came in at $4 and left at $4 totaling ...... }'
thanks for all of your replies.
i'll be answering all questions below:



Chris,
There are more than 100 ID, with different dates.

Rod,
Yes, the output is always the same. fields are always ordered as shown above. and the date will change as the file will be exported every day at midnight, and the calculation will happen then.
Time saving shouldn't be taken into consideration as the most important part for me is the number of hours the person spent INSIDE. the rest of the info is secondary.
While searching for a way to solve this, i came across a number of threads confirming your advice (using perl ) though i'm not familiar with it yet.

grail,
May i bother you in explaining the steps above?
i'm currently reading "awk & sed" book so i get my head around awk and what you posted is way too advanced for me to get on my own


NOTE: i'll have to calculate the hours spent for every id and not just 44246.

Thank you all for your help so far.

Best,

--Roland

Last edited by rollyah; 08-17-2011 at 12:45 AM. Reason: added Note
 
Old 08-17-2011, 03:47 AM   #6
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,516

Rep: Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896Reputation: 1896
This one seems to take care of the data for each ID:
Code:
#!/usr/bin/awk -f

BEGIN{ FS = " *, *" }    # sets the delimiter as a comma with zero or more spaces before or after

x != $2{    # Check for when the ID changes

    if( lastt ){    # if last time is set, ie only after getting first ID
        print "End time: "lastt

        ft = gettime(lasty, lastt)    # Get final time in a format that awk can understand

        print strftime("The time spent is %T", tot, 1)    # print the output in HH:MM:SS format
    }

    x = $2    # Set current ID
    print "ID:",x,"Start time: "$4

    st = gettime($3, $4)    # Get start time in a format that awk can understand
}

{   # Store date and time to be used on changing ID
    lasty = $3
    lastt = $4
}

END{    # The last ID will need to be processed here (same as if above)
    print "End time: "lastt

    ft = gettime(lasty, lastt)

    print strftime("The time spent is %T", tot, 1)
}

function gettime(date, time,    tmp, d, t)
{
    split(date, d, "-")    # Split the date into array d based on dashes
    split(time, t, ":")    # Split the time into array t based on colons

    tmp = sprintf("%d %d %d %d %d %d",d[1],d[2],d[3],t[1],t[2],t[3])    # mktime function requires the data in this order

    return mktime(tmp)    # return time based on fields of the combined arrays
}
Let me know if you need more explanation? You can get all the details for this from here

Last edited by grail; 08-17-2011 at 03:49 AM.
 
1 members found this post helpful.
Old 08-17-2011, 11:45 AM   #7
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Here's the gawk script I'd probably use:
Code:
#!/usr/bin/gawk -f

BEGIN {
	RS="[\n\t\v\f\r ]*[\n\r][\n\t\v\f\r ]*"
	FS="[\t ]*[,][\t ]*"

	split("", idlist)
	split("", daylist)
	split("", pass)
}

# Add an int value to a sorted string of space-separated ints,
# unless the int is already in the list.
# Returns the list as a space-separated string.
function mergeintstring(string, value) {
	result = ""
	v = int(value)
	n = split(string, list, " ")
	i = 1;

	while (i <= n && int(list[i]) < v)
		result = result " " int(list[i++])

	while (i <= n && int(list[i]) == v)
		i++

	result = result " " v

	while (i <= n)
		result = result " " int(list[i++])

	return substr(result, 2)
}

($5 == "Pass") {
	id = int($2)

	# Convert date to numeric value, YYYYMMDD
	day = $3
	gsub(/[^0-9]+/, "", day)
	day = int(day)

	# List key is a combination of the date and the user ID
	key = day " " id

	# Add user ID and date to known lists
	idlist[id] = id
	daylist[day] = day

	# Convert time to seconds
	split($4, list, ":")
	sec = int(60 * 60 * list[1] + 60 * list[2] + list[3])

	# Add the pass time in seconds to the sorted list
	# for this user and this date
	pass[key] = mergeintstring(pass[key], sec)
}

END {
	# Sort the known users and dates in ascending order
	ids = asort(idlist)
	days = asort(daylist)

	for (iday = 1; iday <= days; iday++) {
		day = daylist[iday]
		daystr = sprintf("%d-%02d-%02d", int(day / 10000), int(day / 100) % 100, int(day % 100))

		for (iid = 1; iid <= ids; iid++) {
			id = idlist[iid]
			key = day " " id

			if (key in pass) {

				# Split the list of pass seconds, and calculate total
				n = split(pass[key], list, " ")
				enter = list[1]
				leave = list[1]
				total = 0
				for (i = 1; i < n; i += 2) {
					leave = list[i + 1]
					total += leave - list[i]
				}

				# Convert to minutes (round, not truncate)
				enter = int(enter / 60 + 0.5)
				leave = int(leave / 60 + 0.5)
				total = int(total / 60 + 0.5)
				final = int(list[n] / 60 + 0.5)

				if (n % 2)
					printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d ! unpaired pass at %02d:%02d\n",
					       daystr, id,
					       int(enter / 60), int(enter % 60),
					       int(leave / 60), int(leave % 60),
					       int(total / 60), int(total % 60),
					       int(final / 60), int(final % 60) )
				else
					printf("%s ID %s: %02d:%02d - %02d:%02d for %02d:%02d\n",
					       daystr, id,
					       int(enter / 60), int(enter % 60),
					       int(leave / 60), int(leave % 60),
					       int(total / 60), int(total % 60) )
			} else
				printf("%s ID %s: Out\n", daystr, id)
		}

		# Add an empty line after each date
		printf("\n")
	}
}
The order of the input does not matter to this script at all; it does not need to be sorted at all. This script also ignores duplicate input records. (I am assuming a successful "Pass" takes at least a second, i.e. the same user cannot pass twice in the same second.)

It first reads in the entire file, storing each pass event in a sorted list (string), keyed by date and ID. All dates and IDs seen are also kept in separate lists for display purposes. (The mergeintstring() function takes a string containing a space-separated list of ints, sorted in ascending order, and inserts the int into it unless the string already contains it. The function returns the list as a space-separated string.)

The END rule will loop thorough each date. If the user did not pass that day, then only "Out" is printed for that user. Otherwise the sorted list of pass times is processed. If there is an unpaired pass, it will be ignored in the work time calculations; it will be shown, though. If an unpaired pass occurs, it will always be of course the last event for that day. The last printf() in the script will add an empty line between dates; you can freely remove it (and of course change the output formats) as you wish.

The output for your example input is
Code:
2011-08-15 ID 42052: 15:23 - 15:23 for 00:00 ! unpaired pass at 15:23
2011-08-15 ID 43676: 09:00 - 16:19 for 07:10
2011-08-15 ID 44246: 08:56 - 15:44 for 06:31
2011-08-15 ID 45756: 09:13 - 16:14 for 07:01
2011-08-15 ID 54545: 08:52 - 16:20 for 07:28
 
1 members found this post helpful.
Old 08-18-2011, 04:24 AM   #8
rollyah
LQ Newbie
 
Registered: Jan 2011
Posts: 15

Original Poster
Rep: Reputation: 1
thank you for your help
both of your advices proved useful.
and i'm using it.
 
Old 08-18-2011, 09:00 AM   #9
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,377

Rep: Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108Reputation: 1108
Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.

When records are sorted, you know two things for certain:
  1. All occurrences of a given key-value are adjacent.
  2. If any gaps exist, the records within that gap are nonexistent.
You can process arbitrarily large data-sets (whups, there goes my 1960's-era batch mainframe terminology again...) considering only two records at a time. And that, my good friend, is huge.

Furthermore, "external sorting" is one of the most heavily-studied algorithms out there ... hence the title of Dr. Knuth's book is Sorting and Searching. You can sort multiple millions of records today in mere seconds, using external programs that are lovingly crafted for the purpose.

Think back to the days of punched cards, long before digital computers existed. Those were the techniques that were used back then. That's what "all those spinning reels of tape" were for. It worked then, and it still works now.

Last edited by sundialsvcs; 08-18-2011 at 09:02 AM.
 
Old 08-18-2011, 02:20 PM   #10
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by sundialsvcs View Post
Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.
A semi-rant follows

I disagree, for two reasons:
  1. There is usually no predetermined order in the data.
    Preferred data order depends on the algorithm used to process the data. Imposing a set order to the data limits algorithm choices, and therefore hinders modularity.
    In this specific case, do you sort by user ID, by date, or by string (as lines)?
  2. There usually is a variant of the algorithm that is order insensitive.
    Either there is no sorting done at all, or sorting can be done as a by-product of another operation, at smaller cost that is otherwise possible.

In my gawk script, the unique dates and unique IDs are sorted as lists (for output purposes only -- to get the output sorted by dates and IDs), but the individual events are not. For each ID on each date a separate list of pass events is kept in a sorted list. Due to gawk limitations, that list is a string, and the insert operation is less than optimal.. but there is no actual sort of all events done. As an example, you can switch the output order from dates to IDs by just swapping the outer loops in the END rule.

I'd say my algorithm works a bit like a radix sort on the data. Although quicksort is faster on small data sets, a radix sort is O(N) and will always be faster on large enough data sets. I do realize sundialsvcs was more concerned about memory usage than CPU time used; I just want to point out there are other, perhaps more important considerations here. In this case I consider even CPU time secondary; simplicity, modularity, and ease of adaptation was my principal goals here.

As to memory use, every sorting method requires access to every element. External sorting methods do this in blocks, with data dynamically grouped to "local" and "external" (changing as the algorithm progresses), with "local" data worked on. gawk arrays are limited to RAM+SWAP, but on most machines, that is enough for truly large datasets. In this specific case, a gigabyte of input data contains 25 million pass records; that is almost 300 pass events per second for a full day.

For a larger system, the algorithm I used in my gawk script could be reimplemented in C99, using the memory mapping techniques I've demonstrated earlier. On a 64-bit platform that implementation would basically be limited only by maximum file size. The example program demonstrates a terabyte data set; that maps to 792 pass events per second for a year, 24/7.

Unless I'm severely mistaken, the access patterns would be very efficient, as long as the input has order. If the input is more or less sorted either by date or by user ID, the kernel would write to each file block just once on average, if there is enough RAM. If not, the kernel will have to do extra I/O, but it'll still work fine; it will just take longer.

In general I very emphatically agree with sundialsvcs that one should not forget about all the experience and algorithms developed earlier, specifically those developed to work with severe memory and CPU restrictions. However, I personally feel we should really understand the boundary conditions that applied to those, and recognize that many of them no longer apply. The virtual memory capabilities of 64-bit platforms -- being able to map terabytes of data into memory, and letting the kernel worry about which bits to keep in RAM, and which bits on disk -- is one key example.

I work with large datasets, with simulators that are typically cyclically CPU-starved, then I/O-starved. They most definitely do not have to be, if more efficient algorithms (and workload management) were used. The basic argument against changes is typically "but this works, and is stable; we do not want to change it". I can accept that attitude for existing software, but not for new software. I think we should build and improve upon existing knowledge -- and not just rely on it.

I apologise for the semi-rant. I hope sundialsvcs is not too offended; I'm just frustrated because most programmers do not realize many of the physical limits that have restricted algorithms and their development and implementations no longer apply.
 
  


Reply

Tags
array, awk, date, parse, time


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[Grep,Awk,Sed]Parsing text between XML tags. ////// Programming 5 07-26-2011 11:54 AM
[SOLVED] AWK / SED - Parsing a CSV file with comma delimiter, and some extra needs. PenguinJr Programming 8 05-24-2011 06:28 PM
Parsing log file with awk sebelk Programming 1 08-31-2009 08:47 AM
parsing text using sed/awk or similar??? freeindy Programming 5 07-24-2008 04:04 AM
awk question - parsing xml file epoo Programming 7 01-24-2007 02:13 PM


All times are GMT -5. The time now is 07:25 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration