Limiting how deep in file grep searches

Clutch2 · 02-24-2008, 07:55 AM

I have a script that looks for a certain header entry that breaks the statistics script I am running. Unfortunately grep scans each of many files all the way through when either the the offending entry is in the first 10 or so lines or doesn't matter.

I looked at the man page and didn't see any setting that only says search only so many bytes in file or lines.

Any ideas?

Thanks,

Clutch

b0uncer · 02-24-2008, 08:01 AM

If you want to search only N first (last) lines, you can use head (tail) to put them in for Grep. For example if N = 15 (15 first lines, or 15 last with tail):

Code:

head -15 myfile | grep yourpattern
tail -15 myfile | grep yourpatterm

Haven't tested if it's faster, but could be.

pixellany · 02-24-2008, 08:14 AM

sed -n '5q; /word/p' filename

Prints lines containing "word", until it reaches line 5.

slakmagik · 02-24-2008, 08:21 AM

Code:

:time grep oo ~/.slrn/var/killfiled* >/dev/null

real    0m0.092s
user    0m0.084s
sys     0m0.008s

:time head -20 ~/.slrn/var/killfiled* | grep oo >/dev/null

real    0m0.008s
user    0m0.000s
sys     0m0.008s


:time sed -n '/oo/p;20q' ~/.slrn/var/killfiled* >/dev/null

real    0m0.004s
user    0m0.000s
sys     0m0.000s

Hardly solid benchmarking, but that might give an idea. (I redirect to /dev/null because otherwise I'd be timing the terminal drawing time of the avalanche of stuff grep spits out.)

-- Crap. Pixellany beat me to it while I was 'benchmarking'. Ah well - at least the timings might be interesting.

Clutch2 · 02-24-2008, 08:42 AM

Sweet!

head -15 * | grep mypattern works! I'll have to do timings to see how much faster.

Since this is a bunch of usenet messages that leafnode stored, is there a way to start from an incrementing numerical filename?

Just to be honest, I'm running this on W2k using Cygwin though I do have a FC7 box.

BTW, I know there is a command to time a job, what is it?

Clutch

pixellany · 02-24-2008, 12:31 PM

Would you believe.........time!!

eg:
time find / -name rumplestiltskin

will tell you how long it takes your computer to learn that that name is nowhere in the system. What a waste, because you already knew that....

JWPurple · 02-24-2008, 07:33 PM

Remember that the first run is probably taking the biggest hit just to load the file data into cache. Subsequent runs of the same command are likely to take much less time because the data is already in cache. So ignore the first run.

Clutch2 · 02-25-2008, 08:22 AM

Well, using head actually increased my times since I was greping files from a text based newsgroup. If it had been a binary group, well it would have likely rocked.

On the sed example, can the filename be a wildcard?

Clutch

slakmagik · 02-25-2008, 10:20 AM

Not so far as I know. You could do a loop but that would probably kill any time savings. I don't understand how head increased your times, though, or even why text vs. binary would relate to that.

(Based on more crappy benchmarks, a for loop with sed is way slower a full grep and a full grep is slower than a partial grep with head.)

Tinkster · 02-25-2008, 11:34 AM

Quote:

Based on more crappy benchmarks, a for loop with sed is way slower a full grep and a full grep is slower than a partial grep with head.

What about a find with sed?

Code:

time find ~/.slrn/var -type f -name killfiled* -exec sed -n '/oo/p;20q' {} \;

[edit]
Btw, if the directory structure had a set
depth you COULD use sed with wildcards ...

With e.g. ~/news/altlinuxos/2001/ ~/news/slackwareos/2003/
~/news/awk/2005/ and so on you could simply do

Code:

sed -n '/oo/p;20q' ~/news/*/*/*

[/edit]

Cheers,
Tink

Tinkster · 02-25-2008, 11:37 AM

Quote:

Originally Posted by digiot

Not so far as I know. You could do a loop but that would probably kill any time savings. I don't understand how head increased your times, though, or even why text vs. binary would relate to that.

Well .. if it WAS binary data head would try to find n occurrences
of a LF .... which MAY be few and far between in binaries. With
some bad luck it may need to grep through the whole 1 MB.

Cheers,
Tink

slakmagik · 02-25-2008, 11:51 AM

Quote:

Originally Posted by Tinkster

What about a find with sed?

Wow. Time for me to take a break and get some rest.

Quote:

Originally Posted by Tinkster

Well .. if it WAS binary data head would try to find n occurrences
of a LF .... which MAY be few and far between in binaries. With
some bad luck it may need to grep through the whole 1 MB.

Yeah, good point. So it does relate, but in the reverse sense. 'Course, grep and sed would be in the same boat, I think. Unless that's just the tired talking again.

Clutch2 · 02-25-2008, 02:18 PM

find -mtime 8 seems to find the files I'm interested in but it reports the file name.
That part takes 28 seconds.

How do I connect it to grep in order to get grep to scan each filename output by the find command?

greping every file takes about 11 minutes for just grep, 15 minutes using head|grep

Thanks,
Clutch

pixellany · 02-25-2008, 02:33 PM

Look at the exec option to find.

"man find" for more than you ever wanted to know...

syg00 · 02-25-2008, 03:27 PM

If the "depth" of the match is indeterminate at the beginning of the run, maybe use perl.
I did some tests on scanning big files a while back, and perl ran faster than sed with quit (reboot between very run to obviate cache effects).
If the record count isn't really high (in the hundreds of thousands to millions potentially), probably not worth the effort.