[SOLVED] Display lines before and after string until blank line

cosminel · 12-16-2013, 01:38 AM

Hi druuna, I just tested your latest command and it works flawlessly.

Maybe I can achieve the same results by slightly modifying grail's version. I will try this myself with the very limited awk knowledge I have. The syntax is a bit intimidating but I will study the manual.

Thank you both for the help!

druuna · 12-16-2013, 03:01 AM

@cosminel: Maybe these links will help you with understanding awk a bit better:

- Awk by example, Part 1
- Awk by example, Part 2
- The GNU Awk User's Guide

cosminel · 12-16-2013, 06:34 AM

Thank you for the links.

awk is quite powerful and useful for the things I want to implement at work. Nobody asked me to do it, it's not even in my job description but I really enjoy writing scripts that make everybody's work a bit easier and take advantage of the tools we already have.

Happy holidays!

cosminel · 12-16-2013, 02:06 PM

Ok... I decided that it could be easier to define the record separator with the packets timestamps from tcpdump. It seems trivial but after reading awk documentation for more than 3 hours I am beginning to get frustrated. Timestamp is in this format: hh:mm:ss.xxxxxxxxx. So I want this input:

Code:

19:23:40.439349638
line of text
line of text
line of text

19:23:41.969359154
line of text
line of text
STRING
line of text
line of text

19:23:42.269329771
line of text
line of text
etc

...to become this through awk:

Code:

19:23:41.969359154
line of text
line of text
STRING
line of text
line of text

Basically I only need to define the RS in such a way so that awk will understand that anything in the timestamp format that I've shown represents the record separator. Oh and to include it in the line above the STRING but not in the line below it.

At this point, reading through the documentation provided and what I found on google did not enlighten me not even the slightest. I only found after a lot of staring and reading that only in gawk it is possible to define RS to be something more than one character. OK, installed gawk, tried defining RS like this:

Code:

RS="[:digit:][:digit:]:[:digit:][:digit:]:[:digit:][:digit:].[:digit:][:digit:][:digit:][:digit:][:digit:][:digit:][:digit:][:digit:][:digit:]"

Obviously (to more experienced people), this did not work.

druuna · 12-16-2013, 02:55 PM

Quote:

Originally Posted by cosminel

Basically I only need to define the RS in such a way so that awk will understand that anything in the timestamp format that I've shown represents the record separator. Oh and to include it in the line above the STRING but not in the line below it.

To my knowledge you cannot print the RS itself.

You still haven't provided an answer to post #5 (provide a valid example of the input), the example in post #19 can be tackled like this:

Code:

awk 'BEGIN{ RS="\n\n" } $0 ~ /STRING/' input

BTW: You would need to use [[:digit:]] and not [:digit:]

cosminel · 12-16-2013, 03:36 PM

Hi druuna, indeed after using double square brackets I could succesfully define the timestamps as RS. I only found this as [:digit:] in the documentation.

Wonder if it would be possible to compress the RS definition length by instructing awk to look for (2 digits)

2 digits)

2 digits).(9 digits)

As to the valid example request, I cannot paste the actual data that I want to process at this time but here is the full command:

Code:

tcpdump -nqt -s 0 -A -i any vlan | fgrep -B 6 -A 20 STRING | awk 'BEGIN{ RS="[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]].[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]" } $0 ~ /STRING/'

So basically tcpdump is outputting a lot of packets with timestamps which I then further filter out with fgrep by defining a "data window" before and after the STRING match (-B, -A) and this output is further filtered by awk.

I chose to use fgrep as an intermediary filtering tool because awk is quite slow and I am working with live packets. And there's a lot of them, leaving a tcpdump for 10 seconds yields ~ 18k packets.

Now that I've established how to properly filter out the data in order to have the desired output, I only need to find a way to include the RS in some way. I could live with timestamps print before and after STRING but no timestamps is not yet good enough for me.

grail · 12-16-2013, 06:40 PM

Firstly, RS like all other internal variables are happily printed using there name, ie print RS
The other nice point in awk is that a computed regex value for something like RS or FS (these are common ones)
will return the match for each record

As to the new requirement, I fail to see how the addition of the timestamp helps?
The data is still separated by blank lines and the timestamp for each block will shown as part of the block.

Of course it is your script so you can change it as you see fit

As to the point about shortening the RS definition, assuming v4+ you can simply use curly braces to identify
the number of items required:

Code:

RS="[[:digit:]]{2}"

Lastly, the reason your notes will show only [:digit:] is because this is the character class and these are only ever used inside []

cosminel · 12-16-2013, 07:16 PM

Thank you for your input grail.

Started from the simplest form for the awk syntax that works for my data:

Code:

awk '/STRING/' RS="[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]].[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]"

I tried adding {print RS} like you said:

Code:

awk '/STRING/{print RS}' RS="[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]].[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]"

The returned result is this:

Code:

[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]].[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]

To clarify the timestamp aspect, I found them to be the only reliable RS source. Some strange things happen when trying to define RS with single/double blank lines. I won't expand on this right now.

Right now the big issue is that on the actual server where I want to implement this command, there's only mawk 1.3.3 installed in which defining RS as above returns no results (and gives no funky error).

LE: It seems mawk doesn't support POSIX character classes.
LE2: Managed to define the timestamps as RS via mawk like this:

Code:

RS="[0-9][0-9]:[0-9][0-9]:[0-9][0-9]"

grail · 12-16-2013, 08:13 PM

Bingo ... there are probably those that disagree but personally mawk, nawk and most other variants are generally rubbish
as they remove too many good standard features. Consequently it is always my first investigation in a distro
and the first thing I change to gawk if not already there.

I am guessing you are not allowed to have gawk installed?

You will find there are a number of other options that also will not work

Druuna may be able to assist more but I personally am not familiar with the other derivations.

You may also need to verify my previous comment about the variables being able to be printed but I would have thought
it should still support that even if not the computed regex separators.

As you have probably guessed, you may wish to include that information in further posts when raising questions
due to the possible incompatibilities that may arise from answers supplied

cosminel · 12-16-2013, 08:35 PM

I might be able to install gawk but I'm refraining from modifying anything on that server.

Anyway as I previously said I solved the issue by using:

Code:

RS="[0-9][0-9]:[0-9][0-9]:[0-9][0-9]"

...which works as expected.

Regarding the printable variables issue... I tried enforcing the print RS in the form I showed in my previous post on my Ubuntu test machine (on which I do all my initial script/syntax testing) where gawk is installed (version 3.1.8).

Maybe there is another way to instruct awk to print the RS? I will try to find an answer for this but your input is welcomed and appreciated.

grail · 12-16-2013, 09:31 PM

Something you need to consider with this approach is that the RS with the time stamp you want will actually be the
previous records RS.

Oh, and I am sorry but I misled you a little

Yes RS is a regex but what it is matched to is stored in RT

So putting this together and using your previous example:

Code:

19:23:40.439349638
line of text
line of text
line of text

19:23:41.969359154
line of text
line of text
STRING
line of text
line of text

19:23:42.269329771
line of text
line of text

Once RS has matched timestamp and you are printing the record and RT you will see:

Code:

#Record 1
<this will be a blank as it is the first record prior to the RS>
19:23:40.439349638   # this is RT value
#Record 2
line of text
line of text
line of text

19:23:41.969359154   # this is RT value
#Record 3
line of text
line of text
STRING
line of text
line of text

19:23:42.269329771   # this is RT value
#Record 4
line of text
line of text
                     # this is RT value, blank as there was no match

As you can see this would not be what you want

As for reducing RS, you could do:

Code:

RS="([0-9]{2}:){2}[0-9]{2}[.][0-9]+\n"

druuna · 12-17-2013, 01:30 AM

You and grail have been busy since my last post and some valuable info has come to light.

About the generated output: Are you sure you posted the correct command (tcpdump -nqt -s 0 -A -i any vlan)? You use the -t flag which suppresses time stamping on each dump line. I'm going to assume that is a typo....

You also pipe the tcpdump command to fgrep, which could make life a bit easier for you. When using -A, -B and -C with fgrep (and grep) a group separator is put between matches:

Code:

$ fgrep -A1 -B1 "foo" input
line
foo
line
--
line2
foo
line2
--
line3
foo
line3

You could use this separator to get the individual blocks. If you do there won't be any need for using the time-stamp as RS in awk.

About the blank lines: Depending on what has been used to write the code you are grabbing, the lines could end with ^M, which is not a linux/unix carriage return and might not be interpreted correctly or show up as ^M. If the written code ends with a blank line the last, not so blank, line will contain ^M

If the output is put in a file and you cat that file you will not see the ^M you will see a blank line instead (2 blank lines if counting the blank line that is created by tcpdump):

Code:

$ cat example
08:24:43.038749 IP 10.0.100.1.32786 > 239.255.255.250.1900: UDP, length 324
NT: upnp:rootdevice
NTS: ssdp:alive
USN: uuid:28802880-2880-1880-a880-000cf692e008::upnp:rootdevice


08:24:43.141931 IP 10.0.100.1.32786 > 239.255.255.250.1900: UDP, length 324
NT: upnp:rootdevice
NTS: ssdp:alive
USN: uuid:28802880-2880-1880-a880-000cf692e008::upnp:rootdevice


$ vi example
08:24:43.038749 IP 10.0.100.1.32786 > 239.255.255.250.1900: UDP, length 324
NT: upnp:rootdevice^M
NTS: ssdp:alive^M
USN: uuid:28802880-2880-1880-a880-000cf692e008::upnp:rootdevice^M
^M

08:24:43.141931 IP 10.0.100.1.32786 > 239.255.255.250.1900: UDP, length 324
NT: upnp:rootdevice^M
NTS: ssdp:alive^M
USN: uuid:28802880-2880-1880-a880-000cf692e008::upnp:rootdevice^M
^M

You could strip the ^M, but as I understand you are dealing with a lot of data and process speed might be negatively influenced by adding extra pipes/commands.

Hopefully the above will give you some extra insight in how to tackle your problem and why some solutions don't work.

cosminel · 12-17-2013, 05:53 AM

druuna, the reason I use awk in conjuction with fgrep -A -B is that while I can quickly filter out the tcpdump output, I cannot define an ideal "data window" because the number of lines in the data packets vary, so this is the reason I need to further filter it via awk.

Basically fgrep can only have a fixed data window defined while my data stream has a variable number of lines and this makes fgrep output too much data most of the times but I defined the data window with fgrep in such a way in order to make sure I catch those packets with a larger number of lines.

I found the results output by awk with timestamps defined as RS to be more consistent, this way I am 100% filtering out all undesired data output coming from fgrep.

The good thing is that I found a way to define the timestamps as RS in mawk and as far as I have researched, it seems that mawk is actually much faster that other versions of awk which is good for me since I am processing live packets. Sure, it has some limitations as highlighted in my previous posts but thankfully I managed to circumvent them.

Now I only need to find a way to print the RS but I am quite satisfied with the current results as well.

I really appreciate the help you gave me guys, I am truly grateful.

druuna · 12-17-2013, 06:19 AM

@cosminel: I think you misunderstand me.

You still might need to use tcpdump... | fgrep ... | awk ...., what I'm saying is that fgrep, when using -A, -B or -C (which you do) puts --between the entries found (what you call "data window"). Thus awk is fed with the data from tcpdump AND the -- from fgrep (see example in post #27).

I'm starting to wonder if your approach is wrong. If I assume this is used (from your post #21):

Code:

tcpdump -nqt -s 0 -A -i any vlan | fgrep -B 6 -A 20 STRING | awk 'BEGIN{ RS="shortened_too_long" } $0 ~ /STRING/'

You already grab the string you want using fgrep (the bold part) and individual entries are separated by --. fgrep also already defines the "data window" by using -B6 and -A20, why is the awk part needed?

Or maybe you haven't told us what it is you exactly(!!) want to do.....

And still no valid examples are posted by you.

cosminel · 12-17-2013, 07:33 AM

Uhm, sorry druuna but I already explained in my previous post. I will retry:
- fgrep defines a fixed data window containing the STRING but it also includes additional lines because...
- the data packets containing the STRING have a variable number of lines, some packets with STRING are shorter (let's say 10 lines) while other packets with STRING are longer (let's say 26)

So you see, I need further filtering in order to discard the unnecessary data which fgrep throws out. At this point awk comes into play with its RS definition.

Example:

Code:

timestamp 1
raw data from tcpdump
raw data from tcpdump
STRING
raw data from tcpdump
raw data from tcpdump

timestamp 2
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump

timestamp 3
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
STRING
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump

Ok, let's take the packet with timestamp 3. In order to filter it out with fgrep I need to use -B 3 -A 6. But when fgrep applies this to the packet with timestamp 1, it will include additional data from the packet with timestamp 2, data which I don't want because it belongs to a packet in which I have no interest, it doesn't even contain my STRING. So the output in this case would be:

Code:

timestamp 1
raw data from tcpdump
raw data from tcpdump
STRING
raw data from tcpdump
raw data from tcpdump

timestamp 2
raw data from tcpdump
raw data from tcpdump
--
timestamp 3
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
STRING
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump
raw data from tcpdump

As you can see, this part:

Code:

timestamp 2
raw data from tcpdump
raw data from tcpdump
--

...is extra data which fgrep outputs, it does not interest me and so I need to filter it out. This of course happens because I have set the fixed fgrep data window according to the packet with the biggest possible number of lines which could contain my STRING, in this example the packet with timestamp 3.

The way I posted these examples precisely reflect the actual data I want to filter. I hope I managed to clarify at this point.