parsing a text file - to awk or not to awk ?
Dear all,
i have a text file with a content as such: Quote:
ID 44246 came in at 08:56:00 and left at 15:44:00 totaling 6 hours and 31 mins. i've started with just parsing through the text file and trying to calculate the total of hours where i got stuck as u can see below: Quote:
NB: is it possible to have a two dimensional array in shell? if yes, that would solve this for me as i can change the loop and read line by line while adding the $4 time to it's respectives ID. |
You don't need a 2d array here, you can but there's no need really.
so pseudo awkish code i'd probably use would be something like Code:
{ EDIT - the fields are in order aren't they? I thought there were only two ID's there, but there are more. that's even easier, no need to compare anything, just take the first value, and the last for each ID. |
Is the data always sorted on the second field, evidently 'ID'? Does the time calculation ever have to deal with the likes of change of date, change of daylight savings, etc.? You posted data, but did not use [code] [/code] tags, so formatting may have been altered. Please post your sample data in code tags (same for your code). Is a Perl solution acceptable? (time oriented calculations may be better handled with Perl).
--- rod. |
Here is another just awk alternative:
Code:
#!/usr/bin/awk -f |
Code:
01 ,42052,2011-08-15 ,15:23:00,Pass Code:
ID=(52019 i'll be answering all questions below: Chris, There are more than 100 ID, with different dates. Rod, Yes, the output is always the same. fields are always ordered as shown above. and the date will change as the file will be exported every day at midnight, and the calculation will happen then. Time saving shouldn't be taken into consideration as the most important part for me is the number of hours the person spent INSIDE. the rest of the info is secondary. While searching for a way to solve this, i came across a number of threads confirming your advice (using perl ) though i'm not familiar with it yet. grail, May i bother you in explaining the steps above? i'm currently reading "awk & sed" book so i get my head around awk and what you posted is way too advanced for me to get on my own NOTE: i'll have to calculate the hours spent for every id and not just 44246. Thank you all for your help so far. Best, --Roland |
This one seems to take care of the data for each ID:
Code:
#!/usr/bin/awk -f |
Here's the gawk script I'd probably use:
Code:
#!/usr/bin/gawk -f It first reads in the entire file, storing each pass event in a sorted list (string), keyed by date and ID. All dates and IDs seen are also kept in separate lists for display purposes. (The mergeintstring() function takes a string containing a space-separated list of ints, sorted in ascending order, and inserts the int into it unless the string already contains it. The function returns the list as a space-separated string.) The END rule will loop thorough each date. If the user did not pass that day, then only "Out" is printed for that user. Otherwise the sorted list of pass times is processed. If there is an unpaired pass, it will be ignored in the work time calculations; it will be shown, though. If an unpaired pass occurs, it will always be of course the last event for that day. The last printf() in the script will add an empty line between dates; you can freely remove it (and of course change the output formats) as you wish. The output for your example input is Code:
2011-08-15 ID 42052: 15:23 - 15:23 for 00:00 ! unpaired pass at 15:23 |
thank you for your help
both of your advices proved useful. and i'm using it. |
Categorically speaking, it is always most desirable to sort the file before processing, and to write algorithms that insist upon (and that check for...) this sorted-order requirement.
When records are sorted, you know two things for certain:
Furthermore, "external sorting" is one of the most heavily-studied algorithms out there ... hence the title of Dr. Knuth's book is Sorting and Searching. You can sort multiple millions of records today in mere seconds, using external programs that are lovingly crafted for the purpose. Think back to the days of punched cards, long before digital computers existed. Those were the techniques that were used back then. That's what "all those spinning reels of tape" were for. It worked then, and it still works now. |
Quote:
I disagree, for two reasons:
In my gawk script, the unique dates and unique IDs are sorted as lists (for output purposes only -- to get the output sorted by dates and IDs), but the individual events are not. For each ID on each date a separate list of pass events is kept in a sorted list. Due to gawk limitations, that list is a string, and the insert operation is less than optimal.. but there is no actual sort of all events done. As an example, you can switch the output order from dates to IDs by just swapping the outer loops in the END rule. I'd say my algorithm works a bit like a radix sort on the data. Although quicksort is faster on small data sets, a radix sort is O(N) and will always be faster on large enough data sets. I do realize sundialsvcs was more concerned about memory usage than CPU time used; I just want to point out there are other, perhaps more important considerations here. In this case I consider even CPU time secondary; simplicity, modularity, and ease of adaptation was my principal goals here. As to memory use, every sorting method requires access to every element. External sorting methods do this in blocks, with data dynamically grouped to "local" and "external" (changing as the algorithm progresses), with "local" data worked on. gawk arrays are limited to RAM+SWAP, but on most machines, that is enough for truly large datasets. In this specific case, a gigabyte of input data contains 25 million pass records; that is almost 300 pass events per second for a full day. For a larger system, the algorithm I used in my gawk script could be reimplemented in C99, using the memory mapping techniques I've demonstrated earlier. On a 64-bit platform that implementation would basically be limited only by maximum file size. The example program demonstrates a terabyte data set; that maps to 792 pass events per second for a year, 24/7. Unless I'm severely mistaken, the access patterns would be very efficient, as long as the input has order. If the input is more or less sorted either by date or by user ID, the kernel would write to each file block just once on average, if there is enough RAM. If not, the kernel will have to do extra I/O, but it'll still work fine; it will just take longer. In general I very emphatically agree with sundialsvcs that one should not forget about all the experience and algorithms developed earlier, specifically those developed to work with severe memory and CPU restrictions. However, I personally feel we should really understand the boundary conditions that applied to those, and recognize that many of them no longer apply. The virtual memory capabilities of 64-bit platforms -- being able to map terabytes of data into memory, and letting the kernel worry about which bits to keep in RAM, and which bits on disk -- is one key example. I work with large datasets, with simulators that are typically cyclically CPU-starved, then I/O-starved. They most definitely do not have to be, if more efficient algorithms (and workload management) were used. The basic argument against changes is typically "but this works, and is stable; we do not want to change it". I can accept that attitude for existing software, but not for new software. I think we should build and improve upon existing knowledge -- and not just rely on it. I apologise for the semi-rant. I hope sundialsvcs is not too offended; I'm just frustrated because most programmers do not realize many of the physical limits that have restricted algorithms and their development and implementations no longer apply. |
All times are GMT -5. The time now is 07:08 PM. |