[SOLVED] Bash Regular Expression help

Scottish_Jason · 11-25-2014, 05:54 PM

Hey guys
I am new to regular expressions and could use a little help if possible. Below is some output that I want to parse/filter so I have been trying my hand at some regular expressions.

I am trying to return any data samples with:

Size: 5000 to 30000
Birate: 320 or higher
Length: Not zero

After filtering everything to these requirements I then need to grab the previous line containing the link. I had this working before using the "-B 1" switch in order to get the previous line but once I changed my regex a bit it stopped working.

The best I have come up with so far for parsing the size is the following, but I am aiming for 5 meg to 30meg rather than 10 meg to 30meg

#Size: [0-9][0-9][0-9][0-9][K0-9][KB] #1 meg to 100meg
#Size: [0-3][0-5][0-9][0-9][0-9]KB #10 meg to 30meg

Code:

Sample data

[702] slsk://tarabusaw/E:/_GROOVESHARK/DMC/Commercial/2001/220/03 - DJ Luck & MC Neat Megamix - Les Adams.mp3
Size: 10607KB Bitrate: 192 Length: 0:00 Queue: 0 Speed: 22322 Free: Y filetype: mp3

Code:

Bash script to regex data ( currently not working )

file=$(  cat result.txt | grep 'Size: [0-3][0-5][0-9][0-9][0-9]KB' | grep 'Bitrate: [34][28][0-9]' | grep -v 'Length: 0:00' | grep -B 1 'slsk')
echo $file

Any help would be appreciated guys

norobro · 11-25-2014, 07:20 PM

First, you don't need to use cat. grep takes a file as input.

In your statement the first grep is only going to pipe one line to the next grep and so on. You can try putting -B 1 with each grep.

Quote:

but I am aiming for 5 meg to 30meg rather than 10 meg to 30meg

As long as there is always a space between "Size:" and the first digit the first expression can be [\b,0-3].

One regex would make things a lot cleaner:

Code:

file=$(grep -B 1 'Size: [\b,0-3][0-5][0-9][0-9][0-9]KB.*Bitrate: [34][28][0-9].*Length: 0:00' result.txt | grep -C 1 'slsk')

Scottish_Jason · 11-25-2014, 07:26 PM

Quote:

Originally Posted by norobro

First, you don't need to use cat. grep takes a file as input.

In your statement the first grep is only going to pipe one line to the next grep and so on. You can try putting -B 1 with each grep.

As long as there is always a space between "Size:" and the first digit the first expression can be [\b,0-3].

One regex would make things a lot cleaner:

Code:

file=$(grep -B 1 'Size: [\b,0-3][0-5][0-9][0-9][0-9]KB.*Bitrate: [34][28][0-9].*Length: 0:00' result.txt | grep -C 1 'slsk')

Hey thanks a lot for the reply!

it appears that your regex only shows data with a length of 0:00 instead of not equal to 0:00
also the second line also appears. I'm trying to just dump the slsk: links up to .mp3, but of course only those that match the criteria. Thanks again for the help and do you have any idea why this is happening?

grail · 11-25-2014, 07:37 PM

May want to be careful there as the current solution provided now includes Length = 0.00 which is what was being asked to exclude

Another interesting thing to fact or in would be how well do you know the data prior to running the script?
I ask this because if out of maybe 1000s of lines there are potentially only a handful with 'slsk' in them, the script may be looking at the wrong information first (just a thought)

Another point, from memory bitrate is normally a fixed set of values, ie. I do not think you could have a bitrate of 100 (could be wrong of course).
Assuming correct, [34][28][0-9] would yield results which cannot exist but may be in the data ... again just a thought

Scottish_Jason · 11-25-2014, 07:41 PM

Quote:

Originally Posted by grail

May want to be careful there as the current solution provided now includes Length = 0.00 which is what was being asked to exclude

Another interesting thing to fact or in would be how well do you know the data prior to running the script?
I ask this because if out of maybe 1000s of lines there are potentially only a handful with 'slsk' in them, the script may be looking at the wrong information first (just a thought)

Another point, from memory bitrate is normally a fixed set of values, ie. I do not think you could have a bitrate of 100 (could be wrong of course).
Assuming correct, [34][28][0-9] would yield results which cannot exist but may be in the data ... again just a thought

Yes you are correct, it displays entries only with 0:00 length.
Also the concerns that you raised are not really a concern as the bitrate seems to consistently work and the slsk link is always the line preceding the attribute line (size etc).

You wouldn't happen to have a solution?

norobro · 11-25-2014, 08:35 PM

Sorry I overlooked the "-v".

I had to actually try my code.

Try this:

Code:

file=$(grep -B 1 -P 'Size: [\b,0-3][0-5][0-9][0-9][0-9]KB.*Bitrate: [1][9][0-9].*Length: (?!0:00)' result.txt | grep 'slsk' -C 1)

Scottish_Jason · 11-25-2014, 08:40 PM

Quote:

Originally Posted by norobro

Sorry I overlooked the "-v".

I had to actually try my code.

Try this:

Code:

file=$(grep -B 1 -P 'Size: [\b,0-3][0-5][0-9][0-9][0-9]KB.*Bitrate: [1][9][0-9].*Length: (?!0:00)' result.txt | grep 'slsk' -C 1)

hmmm I appear to get no results with that. I will walk through it step by step tomorrow when I am a bit more awake... Thanks for the help guys

syg00 · 11-25-2014, 08:47 PM

You need to show us more data.
Not the way I would have done it, but we can always learn - thank you all folks, including the OP.

norobro · 11-25-2014, 09:04 PM

@Scottish_Jason - Note that I changed the bit rate expressions to match the one line of data that you supplied.

Scottish_Jason · 11-25-2014, 10:06 PM

Quote:

Originally Posted by syg00

You need to show us more data.
Not the way I would have done it, but we can always learn - thank you all folks, including the OP.

results.txt

Search: mc neat Results from: User: dudu77
[711] slsk://dudu77/c:/users/eduardo/desktop/slsk/musics..................................()/balkan neat/03-(dunkelbunt)_feat_raf_mc_and_fanfare_ciocarlia-asfalt_tango.mp3
Size: 7071KB Bitrate: 96 Length: 10:03 Queue: 35 Speed: 9421 Free: N filetype: mp3

[712] slsk://dudu77/c:/users/eduardo/desktop/slsk/musics..................................()/balkan neat/06-(dunkelbunt)_feat_raf_mc_and_fanfare_ciocarlia-the_chocolate_butterfly.mp3
Size: 5066KB Bitrate: 96 Length: 7:12 Queue: 35 Speed: 9421 Free: Y filetype: mp3

[713] slsk://dudu77/c:/users/eduardo/desktop/slsk/musics..................................()/balkan neat/09-(dunkelbunt)_feat_stblocket-rauk_cocek_(dunkelbunt_rmx_feat_raf_mc).mp3
Size: 7258KB Bitrate: 96 Length: 10:19 Queue: 35 Speed: 9421 Free: Y filetype: mp3

---------
Search: mc neat Results from: User: shoom55
[714] slsk://shoom55/f:/albums/cd1/03 - nng ft kallahan & mc neat - right before my eyes.mp3
Size: 5275KB Bitrate: 160 Length: 4:30 Queue: 26 Speed: 16184 Free: N filetype: mp3

---------
Search: mc neat Results from: User: KiLLaBeeZ
[715] slsk://KiLLaBeeZ/d:/music/[=-various_artists-=]/va-pure_rnb_2-(retail)-2cd-2001-h3x/208-dj_luck_and_mc_neat_feat_jj-aint_no_stoppin_us_now.mp3
Size: 7574KB Bitrate: 180 Length: 5:44 Queue: 343 Speed: 34843 Free: N filetype: mp3

Scottish_Jason · 11-25-2014, 11:01 PM

Quote:

Originally Posted by norobro

@Scottish_Jason - Note that I changed the bit rate expressions to match the one line of data that you supplied.

Yes I see that, but it still should have returned 192k samples in that case

edit: Actually I am getting results now that I have changed the bitrate back to the previous one.. great!
only problem left is that it displays both lines. While writing this I think I just remembered about a switch that prints only one line? will go and check

edit: ohhh -C 1 .... and it is already implemented, hmm...

norobro · 11-25-2014, 11:07 PM

Try this:

Code:

grep -B 1 -P '[0-9][0-9][0-9][0-9][0-9K][KB][\s,B].*[0-9][0-9][\s,0-9].*(?!0:00)' result.txt | grep 'slsk' -C 1

syg00 · 11-25-2014, 11:24 PM

Hmmm, lots of possible corner cases.
If it must be done in bash, I'd probably extract the numeric values into an array, and do real arithmetic tests on the values. sed or grep can do the extraction easily.

Better option might be a language with regex and proper logic idioms. Perl or awk might be a good start.

Scottish_Jason · 11-25-2014, 11:28 PM

Quote:

Originally Posted by norobro

Try this:

Code:

grep -B 1 -P '[0-9][0-9][0-9][0-9][0-9K][KB][\s,B].*[0-9][0-9][\s,0-9].*(?!0:00)' result.txt | grep 'slsk' -C 1

thanks again, but that never worked either... it just spat the whole text file out by the looks of it

Scottish_Jason · 11-25-2014, 11:29 PM

Quote:

Originally Posted by syg00

Hmmm, lots of possible corner cases.
If it must be done in bash, I'd probably extract the numeric values into an array, and do real arithmetic tests on the values. sed or grep can do the extraction easily.

Better option might be a language with regex and proper logic idioms. Perl or awk might be a good start.

I was thinking about doing it that way but came to the conclusion it might be over my head. I am fairly new to bash and regex and have never used perl etc. Only C+