working with for loop/ alternate way welcome
I have 3 files named- log.2018-12-10.gz,log.2018-12-11.gz,log.2018-12-13.gz.
These 3 files contains records in date/time format. (date is according to file name). Aim is to sum hourwise (08 AM to 22 PM) of total records. I have used below command in UNIX and output is as- Command- for i in `ls -1 za.log.2018-12-1[0-3]*`; do zcat $i|grep -i abcd|cut -c 5-6|egrep "0[8-9]|1[0-9]|2[0-2]"|sort|uniq -c;done OutPut- 473 08 765 09 957 10 1085 11 1220 12 1205 13 1143 14 1035 15 920 16 752 17 653 18 526 19 389 20 153 21 130 22 395 08 642 09 877 10 1055 11 1163 12 1130 13 935 14 986 15 929 16 724 17 578 18 537 19 317 20 169 21 119 22 Note- Here first col is count and second one is hour. I want the resulting output in column wise as below- 473 8 395 8 462 8 765 9 642 9 704 9 957 10 877 10 906 10 1085 11 1055 11 953 11 1220 12 1163 12 1180 12 1205 13 1130 13 628 13 1143 14 935 14 645 14 1035 15 986 15 899 15 920 16 929 16 896 16 752 17 724 17 679 17 653 18 578 18 689 18 526 19 537 19 492 19 389 20 317 20 391 20 153 21 169 21 138 21 130 22 119 22 107 22 ## Workaround - I have created separate output files and then used paste command. But want to do in a single command. (There are more than 3 files.) |
Quote:
It's hard to say anything without knowing the input and the expected output, but the script can be improved as follows:
|
yes, we need to know more (or probably an example input file would be better.
it looks like a single awk/perl/python script can do this (and will be faster) |
Just adding response to above-mentioned
I have used ls- as there are other files in my dir. And used sort|uniq -c to get count hour wise. Sort-u will not work. Just adding more clearity to my question. - I want to work on structured output. As you can see my output is in one column format (count-hour). I have shown above the required output. There are other records/entries in my file so I have used cut command to get hour field (hh). |
If the files whose names start with za are regular files, and their names contain no funny characters, then this:
Code:
for i in $(ls -1 za*) Code:
for i in za* I now understand what you want: Sum of all records for hour 8 for file 1, then file 2, then file 3 in the first line. Then the same for hour 9 in the second line, and so on. The problem is that the pipeline inside the for loop first produces output for file 1, then file 2, then file 3 sequentially, so it can’t create the columns you want. Instead, you need to collect the data for all files, then display the count when you have reached the last line in file 3. This is not that easy, and I would say awk is the right tool for it, as pan64 hinted. To do this, we still need to know the input format, at least the date/time format in the input. |
Ok below is the sample file record data where hour field is 12th character (in my command I have used cut -c5-6 for temp, you can say cut -c12-13).
Sample File- log.2018-12-10.gz : 10-12-2018 00:01:15 abcd ...... 10-12-2018 03:12:17 abcd ...... . . . 10-12-2018 08:16:14 abcd ..... 10-12-2018 10:12:01 abcd ..... . . . 10-12-2018 22:12:12 abcd ..... 10-12-2018 23:01:01 abcd ..... Same for other files. we can say there are day wise files for every month. We required sum of records for hour 08-22 for specific files hence I have filtered it using grep and wild character. My command output is in one column for all files hour wise sum and I need file1 output at 1 col. file 2 at 2nd file3 at 3rd col. ....and so on. Actual output format- File1 (Sum hour) 1234 08 1232 09 File2 1243 08 1263 22 File3 5423 08 3456 12 3453 22 Expected output- File1- is at 1st column. File2 is at 2nd column File3 is at 3rd column. |
How well do you know awk?
This is what I came up with, running the following awk program with three file arguments (log.*): Code:
$ awk '{count[substr($2,1,2)] += 1 } The first line collects the lines for each hour. The hour is used as the index of the array count; to get the hour, I use the substr() function to peel the first two characters off the second field ($2) in each line. Second line: ENDFILE only exists in the Gnu version of awk, which is normally the version in Linux distros. It may or may not work in BSD or other UNIXes. ENDFILE signals that the end of a file is reached. At this point, I go through the count array and add the result to another array named result. c is an hour, count[c] is the number of times the hour occurred in a file. After that, I delete the count array so that I can start from scratch with a new file. Third line: END signals the overall end of input. At that point, I dump out the result array. Unfortunately, associative arrays are not sorted in any way. Exercises: Correct sorting (I'd pipe the output into the sort command), and labeling the columns with the file names (needs to be added to the awk program I think). I warmly recommend the awk guide referenced in my signature. EDIT: I wrote the program based on your comment in the original post: Quote:
|
Postprocess your loop with awk (or perl or bash-4 commands) that provides associative arrays. The array is indexed by column#2 and each element collects the string. The array is printed at the end.
Code:
for i in za.log.2018-12-1[0-3]*.gz; do zcat "$i"|grep -i abcd|cut -c 5-6|egrep "0[8-9]|1[0-9]|2[0-2]"|sort|uniq -c;done | awk '{A[$2]=(A[$2] " " $1)} END {for (i in A) print i,A[i]}' |
I do not really know if we need that for loop.
Also zcat and grep can be combined, so: Code:
zgrep -i abcd za.log.2018-12-1[0-3]*.gz # should work. remember, if there were multiple files grep will add filename to the output Code:
zgrep -e -i "201[89] 0[8-9]|1[0-9]|2[0-2] abcd" za.log.2018-12-1[0-3]*.gz # or something similar can work, even cut can be eliminated this way just you need to use associative arrays as it was mentioned Code:
awk ' BEGIN { FS="[: ]" } # set field separator to something convenient |
Trying to make my original post more simplified.
I need hour wise record count separated by "#column" (means record count of new file should be at next column) as shown below- Hour is 12th character position. My file has below data- File1 2019-01-04 00:00:19 2019-01-04 00:00:19 2019-01-04 00:00:19 2019-01-04 01:07:38 2019-01-04 01:07:38 2019-01-04 01:07:38 2019-01-04 08:00:39 2019-01-04 08:02:27 File2 2019-01-04 00:00:19 2019-01-04 01:00:19 2019-01-04 02:00:19 2019-01-04 02:07:38 2019-01-04 02:07:38 2019-01-04 10:07:38 2019-01-04 10:00:39 2019-01-04 13:02:27 File3 2019-01-04 08:00:19 2019-01-04 09:00:19 2019-01-04 09:00:19 2019-01-04 10:07:38 2019-01-04 12:07:38 2019-01-04 12:07:38 2019-01-04 19:00:39 2019-01-04 19:02:27 Output 0 3 0 1 8 1 1 3 1 1 9 2 8 2 2 3 10 1 10 2 12 2 13 1 19 2 Here- Red count is of File1, Blue- File2 and Black- File3. |
did you try [to check] any of the posted scripts?
|
All times are GMT -5. The time now is 09:25 AM. |