[SOLVED] awk to parse time and file transfer size

jzoudavy · 10-06-2015, 11:23 AM

Hi all

I got this ftp server that I am trying to do some performance monitoring. I got the vsftpd logs.

Code:

[root@localhost log]# tail xferlog-20150906
Tue Oct  6 09:07:49 2015 1 192.168.10.1 1170448 /home/15.2.129.tar.gz  
Tue Sep  1 17:49:27 2015 1 192.168.10.2 0 /home/15.3.129.tar.gz  
Wed Sep  2 10:34:01 2015 1 192.168.10.11 0 /home/15.3.129.tar.gz

What I want is to find out the transfer size per day (8th item) and hour. I have worked out how to get the number of transferes per day. But how to break it down by day and hour eludes me.

This is what I got so far:

Code:

## reads number of downloads per day
awk '{count[$3]++} END {for(j in count) print j,"("count[j]" bytes)"}' xferlog*


## reads size of total transfer size from all avaliable logs, ie the last 30 days, --> works now thanks to HMW
awk '{sum+=$8} END {print sum}' xferlog*


## reads number of downloads by the hour --> doesn't work
awk '{count[$4]++} END {for(j in count) print j,"("count[j]" bytes)"}' xferlog*

Can anyone help me out?

#added for clarity below
The problem I have is that if I just added all the 01am downloads together I would get the 1am downloads for every day.
What I am looking for is like this:

Sept 1st 1am downloads 5GB.
Sept 2nd 1am downloads 2GB.

I am not sure how to structure the bash array with in awk to get that.

I also need to strip out the time from hh:mm:ss format into just hh so I can aggregate the hourly transfers. --> this part I am fairly certain I almost got it. sed to the punishment!

Thanks

HMW · 10-06-2015, 11:43 AM

Well, no expert in awk. But maybe I can help you with this part:

Quote:

reads size of total monthly transfer

So, lets say the log file looks like this:

Code:

Tue Oct  6 09:07:49 2015 1 192.168.10.1 10 /home/15.2.129.tar.gz  
Tue Sep  1 17:49:27 2015 1 192.168.10.2 20 /home/15.3.129.tar.gz  
Wed Sep  2 10:34:01 2015 1 192.168.10.11 30 /home/15.3.129.tar.gz

This awk prints out the expected result (60):

Code:

awk '{ sum+=$8 } END { print sum }' vsftp.log 
60

So, using the same logic (or lack thereof, awk still eludes me most of the time!) you ought to be able to solve your third question.

Best regards,
HMW

jzoudavy · 10-06-2015, 12:19 PM

@HMW: thanks for your help, the changes worked!

HMW · 10-06-2015, 12:38 PM

Quote:

Originally Posted by jzoudavy

@HMW: thanks for your help, the changes worked!

Awesome. Please mark the thread as [SOLVED] if you consider this problem thus.

Best regards,
HMW

grail · 10-06-2015, 01:43 PM

ummmm ... I am a little confused (easily done some times). OP said :- reads size of total monthly transfer. Now I am not knocking HMW's solution as it is part way there on what I read that
to mean, but the OP has come back and said this is correct. Using the example data from HMW I would have thought the following would be the correct output:

Code:

# input data
Tue Oct  6 09:07:49 2015 1 192.168.10.1 10 /home/15.2.129.tar.gz  
Tue Sep  1 17:49:27 2015 1 192.168.10.2 20 /home/15.3.129.tar.gz  
Wed Sep  2 10:34:01 2015 1 192.168.10.11 30 /home/15.3.129.tar.gz

# output per month
Oct 10
Sep 50

Please advise exactly the type of data you want or if the suggested solution is actually all you wanted?

jzoudavy · 10-06-2015, 02:09 PM

Hi grail

sorry for the confusion. The current data set I have avaliable is from the past month. Sept 6 till Oct 6th. So that is one month for me. Or at least close enough for my purposes. I should have been clearer and say the past 30 days or so.

grail · 10-06-2015, 02:23 PM

Quote:

Originally Posted by jzoudavy

Hi grail

sorry for the confusion. The current data set I have avaliable is from the past month. Sept 6 till Oct 6th. So that is one month for me. Or at least close enough for my purposes. I should have been clearer and say the past 30 days or so.

And that is fine, but should you have say 2 months worth in the single file, the current solution will only give you the total of all entries in the file and not break it down.
I just wanted you to be aware.

Plus, as you also mentioned by hour, you will need more code to provide that level of detail.

May I also mention, you can create a complete awk script instead of several single awk lines in a bash script:

Code:

#!/usr/bin/awk -f

<your_code_here>

Then if you make it executable you can run it as you would your bash script.

jzoudavy · 10-06-2015, 03:02 PM

Hi grail

Thanks for awk scripting. did not know about that.

am still not sure on the logic of how to modify the code to pick up dates and hours though.

grail · 10-06-2015, 04:12 PM

Here is a rough idea of what i would do:

Code:

# using this input file
Tue Oct  6 09:07:49 2015 1 192.168.10.1 11 /home/15.2.129.tar.gz  
Tue Sep  1 17:47:27 2015 1 192.168.10.2 10 /home/15.3.129.tar.gz  
Tue Sep  1 17:48:27 2015 1 192.168.10.2 20 /home/15.3.129.tar.gz  
Tue Sep  1 17:49:27 2015 1 192.168.10.2 30 /home/15.3.129.tar.gz  
Tue Sep  1 18:49:27 2015 1 192.168.10.2 40 /home/15.3.129.tar.gz  
Tue Sep  1 18:49:27 2015 1 192.168.10.2 50 /home/15.3.129.tar.gz  
Wed Sep  2 10:34:01 2015 1 192.168.10.11 60 /home/15.3.129.tar.gz

# and this code
#!/usr/bin/awk -f

NR > 1 && !( $2 in mon_sum){
	print "Totals for the month of " month ":"

	for( d in per_day_per_hour )
		for( h in per_day_per_hour[d] )
			print " Day :- " d " has a hourly sum of :- " per_day_per_hour[d][h]

	print "Monthly total is :- " mon_sum[month]

	delete per_day_per_hour
}

{
	month = $2
	mon_sum[month]+=$8
	
	split($4,hour,":")
	
	per_day_per_hour[$3][hour[1]]+=$8
}

END{
	print "Totals for the month of " month ":"

	for( d in per_day_per_hour )
		for( h in per_day_per_hour[d] )
			print " Day :- " d " has a hourly sum of :- " per_day_per_hour[d][h]

	print "Monthly total is :- " mon_sum[month]
}

# produces this output
Totals for the month of Oct:
 Day :- 6 has a hourly sum of :- 11
Monthly total is :- 11
Totals for the month of Sep:
 Day :- 1 has a hourly sum of :- 60
 Day :- 1 has a hourly sum of :- 90
 Day :- 2 has a hourly sum of :- 60
Monthly total is :- 210

I would of course put the repeated stuff in a function, but you get the idea

HMW · 10-07-2015, 01:41 AM

Quote:

Originally Posted by grail

ummmm ... I am a little confused (easily done some times). OP said :- reads size of total monthly transfer. Now I am not knocking HMW's solution as it is part way there on what I read that
to mean, but the OP has come back and said this is correct.

Ah, yes. Well spotted! For that level of detail I would probably have used a Python script instead. Nice awk script there grail!

Best regards,
HMW

HMW · 10-07-2015, 03:46 AM

Ok... turns out I managed to do this in Bash (with just a little awk in there). But the logic was trickier than I expected, or maybe it's the fact that I only got four hours of sleep last night. Anyway... with this script.

Code:

#!/bin/bash

MONTH=""
HOUR=""

while read line; do
    CURR_MONTH=$(echo "$line" | awk '{ print $2 }')
    CURR_HOUR=$(echo "$line" | awk '{ print $8 }')
    if [[ $CURR_MONTH == $MONTH ]]; then
        ((HOUR=$HOUR+$CURR_HOUR))
    else # Update $MONTH and reset & add new hour(s) to new $MONTH
        if [[ $MONTH != "" ]]; then
            echo "Month $MONTH total $HOUR hours"
        fi  
        MONTH=$(echo "$line" | awk '{ print $2 }')
        HOUR=$CURR_HOUR
    fi  
done < vsftp.log

# If we have reched EOF, print out the final month
echo "Month $MONTH total $HOUR hours"

And the same infile as grail, I get this output:

Code:

$ ./parselog.sh 
Month Oct total 11 hours
Month Sep total 210 hours

My script is not as fancy as grail's, but I just wanted to give it a go based on the month variable.

Best regards,
HMW

grail · 10-07-2015, 06:25 AM

I like the idea HMW, but I am generally not a fan of using outside commands in my bash script unless I really have to or unless it really gives a huge speed boost.
Also, quick nit pick, you are totaling the downloaded data and not the hours.

So here are a couple of quick variants to show what I would do in bash (again just for downloads per month):

Code:

#!/usr/bin/env bash

MONTH=""
# Below you have 2 alternative while/read combos

# option 1
#while read _ c_month _ _ _ _ _ c_size _; do

# option 2 (if you want to try option 1, comment the next 3 lines)
while read -a data; do
	c_month=${data[1]}
	c_size=${data[7]}

	if [[ $c_month == $MONTH ]]; then
		(( size += c_size ))
	else # Update $MONTH and reset & add new hour(s) to new $MONTH
		[[ $MONTH ]] && echo "Month $MONTH total $size size"

		MONTH=$c_month
		size=$c_size
	fi  
done < "$1"

# If we have reched EOF, print out the final month
echo "Month $MONTH total $size size"

HMW · 10-07-2015, 06:53 AM

Quote:

Originally Posted by grail

Also, quick nit pick, you are totaling the downloaded data and not the hours.

Yes, I know. Your awk was more complete in that regard. I ran out of gas.

Thanks for your version, appreciate it. Especially the approach to read data (line) into an array with the -a option. Gonna save that in memory!

Thanks again for taking the time buddy!

Best regards,
HMW

Zzzzzzz...

jzoudavy · 10-07-2015, 03:55 PM

hey HMW and grail

thanks for all your help on this. I got a question:

why the double brackets?

(( size += c_size ))
((HOUR=$HOUR+$CURR_HOUR))

jzoudavy · 10-07-2015, 05:21 PM

actually I got a follow up questions.
I have sanitized the input a bit more to make life easier:
So for the DL rate I am simply skipping everything after the decimal.

it was like this: 67402.46Kbyte/sec but arithmetically it doesn't work so I just used set to seperate the decimal and the kbyte/sec and use just 67402.

Code:

 
Oct 7 13 36 42 208430626 bytes, 67402 46 Kbyte/sec
Oct 7 13 36 53 7004609 bytes, 55082 20 Kbyte/sec
Oct 7 13 36 53 7004596 bytes, 38641 45 Kbyte/sec
Oct 7 13 36 53 7004326 bytes, 53266 48 Kbyte/sec
Oct 7 13 36 53 7003780 bytes, 48976 23 Kbyte/sec
Oct 7 13 37 23 11188721 bytes, 57261 59 Kbyte/sec
Oct 7 13 37 23 11187409 bytes, 49023 38 Kbyte/sec
Oct 7 13 38 12 2013066706 bytes, 45416 61 Kbyte/sec
Oct 7 13 38 15 2344883553 bytes, 48741 71 Kbyte/sec
Oct 6 09 07 49 1170448 bytes, 28916 61 Kbyte/sec

and modified/experimented the script that you have both provided to below, but I keep getting banged at line 16 and 21 with the brackets.

Code:

#!/bin/bash
#declare everything
MONTH=""
DAY=""
HOUR=""
DL_SIZE=0
DL_RATE=0


# read in via array, if i use awk then DL_SIZE gets treated as a string then the whole thing becomes string concatenation
while read -a line; do 
    CURR_MONTH=${line[1]}
    CURR_DAY=${line[2]}
    CURR_HOUR=${line[3]}
    CURR_DL_SIZE=${line[6]}
    CURR_DL_RATE=${line[8]}
	
#if same month, same day and same hour, add the DL size and DL rate, DL rate for avg hourly transfer rate, which will be implemented later

    if [ [ $CURR_MONTH == $MONTH ] && [ $CURR_DAY == $DAY ] && [ $CURR_HOUR == $HOUR ] ] 
	then
        DL_SIZE=$DL_SIZE+$CURR_DL_SIZE
	DL_RATE=$DL_RATE+$CURR_DL_RATE
		
    else # Update $MONTH and $DAY and $HOUR  
        if [ [  $MONTH != "" ] ]
	then
            echo "$MONTH $Day $HOUR total $DL_SIZE bytes " # introduce average DL rate later
        fi  
        MONTH=${line[1]}
	DAY=${line[2]}
	HOUR=${line[3]}
        
    fi  
done < simplified.vsftpd.log

# If we have reched EOF, print out the final month
echo "$MONTH $Day $HOUR total $DL_SIZE bytes "

when I run it I get this result and I have no idea why... line 16 and 32 are my if statements.

Code:

 
./ftp_analyzer.sh: line 16: [: 7: binary operator expected
./ftp_analyzer.sh: line 22: [: too many arguments
./ftp_analyzer.sh: line 16: [: too many arguments
./ftp_analyzer.sh: line 22: [: too many arguments
./ftp_analyzer.sh: line 16: [: too many arguments
./ftp_analyzer.sh: line 22: [: too many arguments