[SOLVED] limiting grep's return

atjurhs · 09-06-2012, 11:05 AM

Hi guys,

I have a bunch of files that have many columns of space seperated data. I need to search through all those files and find which ones have a specific numerical string - easy to do with the awk command, but awk will list out all those files that have that string in any column or is a part of a larger number. Well that doesn't work for me.

So I'm trying to write an awk script that uses grep that will only search a user specified column of data and must not be part of a larger number. In that second requirement, I'm trying a condition that looks for a space both before and after the numerical string.

I'm thinking this should be fairly simple, and done many times befor, but it's not for me

thankz, Tabitha

kabamaru · 09-06-2012, 11:20 AM

I'm not an AWK guru by any stretch but you can specify the field (column) and exact numerical string like this:

Code:

awk '$NUMBER_OF_COLUMN == NUMERICAL_STRING' file1 file2 file3

example (5th column, number "546"):

Code:

awk '$5 == 546' file1 file2 file3

This will only match lines where the 5th column is equal to that number.

kabamaru · 09-06-2012, 12:46 PM

And, to just print the files that match the criteria:

Code:

awk '$5 == 546 { print FILENAME }' file1 file2 file3 | uniq

...or the filenames and the matching lines:

Code:

awk '$5 == 546 { print FILENAME, ":", $0 }' file1 file2 file3

atjurhs · 09-06-2012, 01:53 PM

to loop over a hundred plus files, I can't list them out, so can a do a while/do/done loop like this:

Code:

i=1
numfiles=10000
# choosing a numfiles value that will be much much greater than the actual number of files in the directory
while ((i<=numfiles))
do
   awk '$5 == 546 { print FILENAME }' file1 file2 file3 | uniq   
((i=i+1))
done

I'm not able to test this right now, maybe not till Friday or Monday, so that's why I'm asking and not testing it out

kabamaru · 09-06-2012, 03:16 PM

No, that loop will execute the same command 10000 times on the same files (file1, file2, file3).

Why not use bash's filename expansion through wildcards? That way awk is only called once.
If you want all the files in a directory to be processed:

Code:

awk '$5 == 546 { print FILENAME }' * | uniq

or all the *.txt files in the directory:

Code:

awk '$5 == 546 { print FILENAME }' *.txt | uniq

If you want to put this in a script:

Code:

#!/bin/sh
# Filename: myscript.sh

awk '$5 == 546 { print FILENAME }' "$@" | uniq

make it executable:

Code:

chmod +x myscript.sh

and run the script like this (examples):

Code:

./myscript.sh file_1 file_2 file_3 file_n

or (all files)

Code:

./myscript.sh *

or (all .txt, .log, and .conf files):

Code:

./myscript.sh *.{txt,log,conf}

atjurhs · 09-06-2012, 08:41 PM

very cool! thanks!!!

David the H. · 09-07-2012, 03:05 AM

Could you please remove the long, unbroken "###" lines from your post? They do nothing but make my browser window side-scroll. Thanks.

When posting questions about processing text files, it usually helps to provide an actual example of the input text, and what the output needs to be. Also post any commands you've already tried, so that we can see what you're thinking.

I say this because awk is a full scripting language of its own and capable of doing very exact matches. When you say something like "awk will list out all those files that have that string in any column or is a part of a larger number", that usually just means you haven't used the right awk command.

atjurhs · 09-11-2012, 10:36 AM

Kabamaru, I ended up using:

Code:

#!/bin/sh

awk '$5 == 546 { print FILENAME }' *.txt | uniq > out.txt

and it works really really well, thanks soooo much!

I'd like to add one more peice to the code, a line counter. So that in the output file instead of just listing:

fileABC.txt
fileDEF.txt
fileIJK.txt
fileQRS.txt
fileXYZ.tkt

Is there a way to run this script with a line counter, so that the number of occrances if each line in each file is also listed in the output file. Something like this:

fileABC.txt 5938
fileDEF.txt 13
fileIJK.txt 19
fileQRS.txt 3984
fileXYZ.tkt 6105

this way now I know not to waste my time on fileDEF.txt and fileIJK.txt and spend more time analying the other files because they have more data for me to use.

kabamaru · 09-11-2012, 11:39 AM

How about this one? Put this in a file e.g. "myscript.awk":

Code:

#!/usr/bin/awk -f

$5 == 546 { if (FILENAME != last && last != "") {
                print last, count
                count = 0
            }
            count++
            last = FILENAME
          }

END { print last, count }

First make it executable, and then run it like this:

Code:

./myscript.awk *.txt

Replace the red characters appropriately.

atjurhs · 09-11-2012, 03:52 PM

it ran successfully by

Code:

 ./myscript.awk *.txt | uniq > out.txt

I'm wondering if there is a way to put the

Code:

 *.txt | uniq > out.txt

inside the myscript.awk ?

I tried putting it at the end much like you did on you did in your 5th post, but that didn't work.

Tabitha

kabamaru · 09-11-2012, 04:29 PM

That's easy. You can put all this in a shell script with your awk program enclosed within single quotes:

Code:

#!/bin/sh

awk '
$5 == 546 { if (FILENAME != last && last != "") {
                print last, count
                count = 0
            }
            count++
            last = FILENAME
          }

END { print last, count }
' *.txt > out.txt

Btw you don't need 'uniq' anymore, as every output line will inevitably be unique ;-)

atjurhs · 09-11-2012, 04:52 PM

aaaargh!!!

I almost got that, only I placed the single quote after

Code:

last = FILENAME
}'

thinking it should close before the END. I'm almost getting this stuff, just barely missing

thank so so much, here's a virtual hug

kabamaru · 09-12-2012, 04:18 AM

And you can sort the results by number of occurrences (descending order) by replacing

Code:

> out.txt

with

Code:

| sort -k2nr > out.txt

Output:

Code:

fileXYZ.tkt 6105
fileABC.txt 5938
fileQRS.txt 3984
fileIJK.txt 19
fileDEF.txt 13

Cheers.

atjurhs · 09-12-2012, 08:52 AM

cool!!!