[SOLVED] Using Sed to recognise second value in TSV text file... remove all others

hector00 · 08-10-2013, 05:35 AM

Hi Gurus,
Been at Sed again and not getting too far.
I've loads of text files (which represent dictionaries of inverted text indexes) the content of which looks like this

Code:

475470
#term	doc freq	idx
carbendacime	1	114569
carbendacime35	1	114570
carbendazim	1	114571
carbene	5	114572
carbeni	5	114573
carbenicillin	4	114574
carbenoxolone	1	114575
carbethoxypsoralen	1	114576

Here I only care about the first and second tokens which are term and doc freq e.g. carbendacime and 1, carbendacime35 and 1, carbendazim and 1, etc.

I would like to use Sed to identify all terms which have a doc freq value of >=10, I then want to print this out the tuple to a new text file.

Any advice on whether to even use sed as oppose to awk would be greatly appreciated.
Thank you
Lewis

Firerat · 08-10-2013, 09:54 AM

awk would be much easier

looks like tabs as FS,

Code:

awk -F $'\t' '($2 >= 10 ) {printf "%s\t%s\n",$1,$2}' Input

Note
your sample data will only return the header

if you don't want the header

Code:

awk -F $'\t' '(!/#term/ && $2 >= 10) {printf "%s\t%d\n",$1,$2}' Input

if you are not fussed about the output having tabs, then

Code:

awk -F $'\t' '(!/#term/ && $2 >= 10) {print $1" "$2}' Input

hector00 · 08-11-2013, 05:10 PM

This is absolutely dynamite and exactly what I was after.
I am amazed at how powerful awk is but the syntax throws me off everytime.
Thank you so much for the verbose answer... really appreciated.
Best
Lewis

chrism01 · 08-11-2013, 08:57 PM

There's a good awk HOWTO here http://www.grymoire.com/Unix/Awk.html

hector00 · 08-11-2013, 09:20 PM

Thanks chrism01.