working with for loop/ alternate way welcome

sagardauti123 · 01-04-2019, 08:59 AM

I have 3 files named- log.2018-12-10.gz,log.2018-12-11.gz,log.2018-12-13.gz.

These 3 files contains records in date/time format. (date is according to file name).

Aim is to sum hourwise (08 AM to 22 PM) of total records.

I have used below command in UNIX and output is as-

Command-

for i in `ls -1 za.log.2018-12-1[0-3]*`; do zcat $i|grep -i abcd|cut -c 5-6|egrep "0[8-9]|1[0-9]|2[0-2]"|sort|uniq -c;done

OutPut-

473 08
765 09
957 10
1085 11
1220 12
1205 13
1143 14
1035 15
920 16
752 17
653 18
526 19
389 20
153 21
130 22
395 08
642 09
877 10
1055 11
1163 12
1130 13
935 14
986 15
929 16
724 17
578 18
537 19
317 20
169 21
119 22

Note- Here first col is count and second one is hour.

I want the resulting output in column wise as below-

473 8 395 8 462 8
765 9 642 9 704 9
957 10 877 10 906 10
1085 11 1055 11 953 11
1220 12 1163 12 1180 12
1205 13 1130 13 628 13
1143 14 935 14 645 14
1035 15 986 15 899 15
920 16 929 16 896 16
752 17 724 17 679 17
653 18 578 18 689 18
526 19 537 19 492 19
389 20 317 20 391 20
153 21 169 21 138 21
130 22 119 22 107 22

##
Workaround - I have created separate output files and then used paste command. But want to do in a single command.

(There are more than 3 files.)

berndbausch · 01-04-2019, 09:20 AM

Quote:

Originally Posted by sagardauti123

I have 3 files named- log.2018-12-10.gz,log.2018-12-11.gz,log.2018-12-13.gz.

These 3 files contains records in date/time format. (date is according to file name).

Aim is to sum hourwise (08 AM to 22 PM) of total records.

I have used below command in UNIX and output is as-

Command-

for i in `ls -1 za.log.2018-12-1[0-3]*`; do zcat $i|grep -i abcd|cut -c 5-6|egrep "0[8-9]|1[0-9]|2[0-2]"|sort|uniq -c;done

Not sure what your question is. Is the output incorrect? Do you want a better way?

It's hard to say anything without knowing the input and the expected output, but the script can be improved as follows:

Why do you use ls to list the files? Just say for i in za.*.
You most likely have a command named zgrep that allows you to remove the zcat.
sort has an option -u. No need to pipe the output to uniq.

For more feedback, provide the data. And please, add code tags (see below) to commands, input and output. Much easier to read.

pan64 · 01-04-2019, 09:26 AM

yes, we need to know more (or probably an example input file would be better.
it looks like a single awk/perl/python script can do this (and will be faster)

sagardauti123 · 01-04-2019, 09:45 AM

Just adding response to above-mentioned

I have used ls- as there are other files in my dir. And used sort|uniq -c to get count hour wise. Sort-u will not work.

Just adding more clearity to my question.
-
I want to work on structured output.
As you can see my output is in one column format (count-hour).
I have shown above the required output.

There are other records/entries in my file so I have used cut command to get hour field (hh).

berndbausch · 01-04-2019, 01:11 PM

If the files whose names start with za are regular files, and their names contain no funny characters, then this:

Code:

for i in $(ls -1 za*)

is the same as

Code:

for i in za*

If the filenames contain blanks or other characters that the shell interprets as separators, the “ls” solution will not produce the desired result, but the second solution might.

I now understand what you want: Sum of all records for hour 8 for file 1, then file 2, then file 3 in the first line. Then the same for hour 9 in the second line, and so on.

The problem is that the pipeline inside the for loop first produces output for file 1, then file 2, then file 3 sequentially, so it can’t create the columns you want. Instead, you need to collect the data for all files, then display the count when you have reached the last line in file 3. This is not that easy, and I would say awk is the right tool for it, as pan64 hinted.

To do this, we still need to know the input format, at least the date/time format in the input.

sagardauti123 · 01-04-2019, 09:09 PM

Ok below is the sample file record data where hour field is 12th character (in my command I have used cut -c5-6 for temp, you can say cut -c12-13).

Sample File-
log.2018-12-10.gz :
10-12-2018 00:01:15 abcd ......
10-12-2018 03:12:17 abcd ......
.
.
.
10-12-2018 08:16:14 abcd .....
10-12-2018 10:12:01 abcd .....
.
.
.
10-12-2018 22:12:12 abcd .....
10-12-2018 23:01:01 abcd .....

Same for other files. we can say there are day wise files for every month.
We required sum of records for hour 08-22 for specific files hence I have filtered it using grep and wild character.

My command output is in one column for all files hour wise sum and I need file1 output at 1 col. file 2 at 2nd file3 at 3rd col. ....and so on.

Actual output format-

File1
(Sum hour)
1234 08
1232 09

File2
1243 08
1263 22

File3
5423 08
3456 12
3453 22

Expected output-
File1- is at 1st column.
File2 is at 2nd column
File3 is at 3rd column.

berndbausch · 01-04-2019, 10:15 PM

How well do you know awk?

This is what I came up with, running the following awk program with three file arguments (log.*):

Code:

$ awk '{count[substr($2,1,2)] += 1 }
ENDFILE { for (c in count) result[c] = result[c] " " count[c] " " c; delete count }
END { for (r in result) print result[r] }' log.*
 1 08 1 08 1 08
 2 00 3 00 2 00
 2 10 2 10 2 10
 1 03 1 03 1 03
 1 22 1 22 1 22
 1 23 1 23 2 23

The program is based on one of awk's most powerful features, associative arrays.

The first line collects the lines for each hour.
The hour is used as the index of the array count; to get the hour, I use the substr() function to peel the first two characters off the second field ($2) in each line.

Second line: ENDFILE only exists in the Gnu version of awk, which is normally the version in Linux distros. It may or may not work in BSD or other UNIXes. ENDFILE signals that the end of a file is reached. At this point, I go through the count array and add the result to another array named result. c is an hour, count[c] is the number of times the hour occurred in a file.
After that, I delete the count array so that I can start from scratch with a new file.

Third line: END signals the overall end of input. At that point, I dump out the result array. Unfortunately, associative arrays are not sorted in any way.

Exercises: Correct sorting (I'd pipe the output into the sort command), and labeling the columns with the file names (needs to be added to the awk program I think).

I warmly recommend the awk guide referenced in my signature.

EDIT: I wrote the program based on your comment in the original post:

Quote:

I want the resulting output in column wise as below-

473 8 395 8 462 8
765 9 642 9 704 9
957 10 877 10 906 10

which is not what you say in post 6.

MadeInGermany · 01-05-2019, 02:31 AM

Postprocess your loop with awk (or perl or bash-4 commands) that provides associative arrays. The array is indexed by column#2 and each element collects the string. The array is printed at the end.

Code:

for i in za.log.2018-12-1[0-3]*.gz; do zcat "$i"|grep -i abcd|cut -c 5-6|egrep "0[8-9]|1[0-9]|2[0-2]"|sort|uniq -c;done | awk '{A[$2]=(A[$2] " " $1)} END {for (i in A) print i,A[i]}'

pan64 · 01-05-2019, 03:59 AM

I do not really know if we need that for loop.
Also zcat and grep can be combined, so:

Code:

zgrep -i abcd za.log.2018-12-1[0-3]*.gz # should work. remember, if there were multiple files grep will add filename to the output

also the two greps probable can be combined together:

Code:

zgrep -e -i "201[89] 0[8-9]|1[0-9]|2[0-2] abcd" za.log.2018-12-1[0-3]*.gz # or something similar can work, even cut can be eliminated this way

and now sort and uniq are completely superfluous, because awk can sum up what you need without that
just you need to use associative arrays as it was mentioned

Code:

awk ' BEGIN { FS="[: ]" }   # set field separator to something convenient
                            # now $1 is the filename and $3 is the hour (if I didn't miss something)
      { A[$3][$1]++ }       # this is the sort/uniq in one
      END { 
        for (hour in A) {
           printf hour ":"
           for (host in A[hour]) {
               printf " " A[hour][host]
           }
           printf "\n"
        }
     }                     # print the required result (something like this)

This double loop on the array A is quite similar to the MULTI used here: https://stackoverflow.com/questions/...nsional-arrays, see the last comment.

sagardauti123 · 01-08-2019, 03:30 AM

Trying to make my original post more simplified.

I need hour wise record count separated by "#column" (means record count of new file should be at next column) as shown below- Hour is 12th character position.

My file has below data-

File1
2019-01-04 00:00:19
2019-01-04 00:00:19
2019-01-04 00:00:19
2019-01-04 01:07:38
2019-01-04 01:07:38
2019-01-04 01:07:38
2019-01-04 08:00:39
2019-01-04 08:02:27

File2

2019-01-04 00:00:19
2019-01-04 01:00:19
2019-01-04 02:00:19
2019-01-04 02:07:38
2019-01-04 02:07:38
2019-01-04 10:07:38
2019-01-04 10:00:39
2019-01-04 13:02:27

File3

2019-01-04 08:00:19
2019-01-04 09:00:19
2019-01-04 09:00:19
2019-01-04 10:07:38
2019-01-04 12:07:38
2019-01-04 12:07:38
2019-01-04 19:00:39
2019-01-04 19:02:27

Output

0 3 0 1 8 1
1 3 1 1 9 2
8 2 2 3 10 1
10 2 12 2
13 1 19 2

Here- Red count is of File1, Blue- File2 and Black- File3.

pan64 · 01-08-2019, 03:40 AM

did you try [to check] any of the posted scripts?