grepping millions of files faster

Skaperen · 04-05-2023, 06:30 PM

i want to search among millions of files for ones that have a specific string. if any file has that string in an interesting way, it will be within the first 512 bytes of the file. so what i would like to find is a grep program that can be told (by option, environment variable, configuration file, whatever) to give up after 512 bytes (or some higher number if choices are limited).

i have already considered the idea of copying a limited size of the file if is larger than 512 byes. i've ruled this out for performance issues (the reason i'm looking for this, in the first place).

the grep command does have a limit feature with option -n but this counts number of matches, not number of non-matches.

so, i am looking for a better for of grep with this ability.

szboardstretcher · 04-05-2023, 07:02 PM

If I am understanding all of this correctly, this will work:

Code:

#!/bin/bash

SEARCH_STRING="search string"
MAX_BYTES=512

for file in $(find /path/to/directory -type f); do
  head -c $MAX_BYTES "$file" | grep -q "$SEARCH_STRING"
  if [ $? -eq 0 ]; then
    echo "$file contains $SEARCH_STRING"
  fi
done

As for perfomance? I have no idea. There is probably a more 'one-liner-y' way to do this. If you need that just reach back out.

michaelk · 04-05-2023, 07:04 PM

binwalk might work for your needs. It will search for a raw string with a specified sequence and you can limit the number of bytes to scan.

szboardstretcher · 04-05-2023, 07:11 PM

binwalk is a cool program. Nice suggestion.

FWIW: when i installed it on my debian system just now, it had to install 491MB of other packages to make it work!

michaelk · 04-05-2023, 07:17 PM

I've never used the program but it is designed to scan binary files looking for specific signatures so that makes sense there is a bunch of other dependencies.

dugan · 04-05-2023, 07:47 PM

At this scale? It's time feed them into ElasticSearch.

syg00 · 04-05-2023, 11:15 PM

What's wrong with dd ?. Use it to read a single sector.
No way to stop this all polluting the system tho'.

One off or regular/intermittent requirement ?.

pan64 · 04-06-2023, 12:14 AM

anyway, if there are really millions of files to scan better to forget shell and grep (one by one).
Theoretically you can use dd to read only 512 bytes, but that will be slow again (because you need to fork dd for every file).
Much faster solution would be to use a language which can read files directly and check the content without any external/other tool.
Like python or perl.
But simply grep -m 1 -r <pattern> <dir> might work for you.

dugan · 04-06-2023, 08:31 AM

Also: solid drive or platter drive? If it’s a platter drive then the seeks are going to be a huge bottleneck.

Skaperen · 04-06-2023, 07:53 PM

Quote:

Originally Posted by dugan

Also: solid drive or platter drive? If it’s a platter drive then the seeks are going to be a huge bottleneck.

right. and it is a few spinning platters for now. i'm hoping to get this to be solid next year.

Skaperen · 04-06-2023, 07:53 PM

Quote:

Originally Posted by syg00

What's wrong with dd ?. Use it to read a single sector.
No way to stop this all polluting the system tho'.

One off or regular/intermittent requirement ?.

nothing really wrong with it, but it's kind of like head. it involves piping the data between 2 processes, which appears to be the best solution short of hacking grep.

Skaperen · 04-06-2023, 07:55 PM

Quote:

Originally Posted by pan64

But simply grep -m 1 -r <pattern> <dir> might work for you.

maybe with a larger -m.

dugan · 04-07-2023, 01:19 PM

Is this your use case?

https://www.linuxquestions.org/quest...9/#post6422884

Skaperen · 04-08-2023, 02:07 PM

Quote:

Originally Posted by dugan

Is this your use case?

https://www.linuxquestions.org/quest...9/#post6422884

no.

the comments thing involves generating a string that is to be added by others to source to be managed. when this key added as a comment is used, the file is being handled for managing. actually they can be inserted anywhere. but it would likely be easier to add the key in the form of a comment at the front or back of the file. for some languages like Python, a form which is a large string literal that is discarded or otherwise does not affect the program could be used to hold the key.

the grep thing is to look for some files in my personal archive of all kinds of files of which maybe about 5% is source code. i just happen to know that what i am looking for is only in short files (easy to filter) in some cases or at the beginning of larger files (not at the end) in some other cases. for this i usually need to manually check files to see if they are what i need to find, but i can only remember poorly what strings are involved. i recently needed to find someone's name i could not remember but i could remember their street. but i this case the files could have lots of data appended. i could have spent a few hours on it and worked out a way to find it. but i did know the street this person lived (same as my own back then). the long scan gave me a list of about 50 files, which was a small enough list to check manually.

pan64 · 04-09-2023, 02:04 AM

anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).