grepping millions of files faster

Skaperen · 04-10-2023, 01:57 PM

Quote:

Originally Posted by pan64

anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).

i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.

for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.

pan64 · 04-11-2023, 12:51 AM

Quote:

Originally Posted by Skaperen

i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.

for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.

If the bottleneck is the drive you can use python. But I think it is the extremely inefficient code you use. Anyway if your disk is that slow you cannot speed it up, because you have to read those files. In that case you ought to create a database or something similar to make it significantly faster.

Skaperen · 04-13-2023, 06:24 PM

i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.

sundialsvcs · 04-13-2023, 07:13 PM

Personally, I would move to "the true programming language of your choice." Perl, PHP, Ruby, whatever.

The first line of your script is a "shebang" ... such as #!/usr/bin/perl. And, off you go. The shell reads this line, "forks" the appropriate child process, and hands over control and the remaining source-code to it. The end-user is none the wiser. (Nor does he even care.)

Write a program that navigates through the file hierarchy, starting with the location that you provide as the first program argument. (The directory-navigation logic is provided by the language, and every language has one ... each its own.) Your program attempts to open each file – graciously handling any refusals. Then, it reads the first 512 (or whatever) bytes from it, and then performs a regular-expression match, printing the name of every file that qualifies.

The "performance" of your program will be constrained by how fast it can navigate through the directory tree, and I would argue that you really can't improve upon this because, in the end, you are dealing with a physical device. Therefore, I see no productive benefit from "multi-threading and so forth."

In any "real [interpreted ...] programming language" that I can now think of, this task should require only a couple of days to perfect. It will get the job done, and it should run very acceptably fast. "Problem solved."

pan64 · 04-14-2023, 12:25 AM

Quote:

Originally Posted by Skaperen

i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.

I don't know. Without checking your code we cannot say anything, but scanning millions of files will definitely take some time. You can check for example:

Code:

time find <dir> -type f >/dev/null                 # just finding the files
time find <dir> -type f -exec cat {} \; >/dev/null # reading those files, this means a huge amount of cat execution
# or
time grep -r -m 1 . <dir>   # . is the pattern here

to see the absolute minimal execution time. There is no way to be faster (especially on a spinning drive).

syg00 · 04-14-2023, 03:49 AM

Been 8 days since the thread was launched - I wonder how many files could be grepped in that time ... ???

Skaperen · 04-14-2023, 09:09 PM

Quote:

Originally Posted by syg00

Been 8 days since the thread was launched - I wonder how many files could be grepped in that time ... ???

maybe a quarter million :-)

i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.

the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?

grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"

else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.

pan64 · 04-15-2023, 01:36 AM

Quote:

Originally Posted by Skaperen

maybe a quarter million :-)

i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.

the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?

grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"

else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.

You gave us nothing. No measurements, no sample code, no facts, log files, tests, just a few words about your plans.
It is completely ok from my side, just I can't see any progress.
Based on this last post I think you have almost nothing but a wish and no idea how to realize that.

MadeInGermany · 04-15-2023, 02:32 PM

The following is slow bash code but uses minimal I/O

Code:

find /path/to/directory -type f -exec /bin/bash -c '
  SEARCH_STRING="search string"
  for fn
  do
    read -rN 512 rec < "$fn"
    case $rec in ( *"$SEARCH_STRING"* ) echo "$fn"; esac
  done
' bash.bash {} +

Skaperen · 04-16-2023, 01:34 PM

when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.

pan64 · 04-16-2023, 01:50 PM

Quote:

Originally Posted by Skaperen

when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.

No, we do not expect that. We simply can't help improve the code if we can't examine it.
(I've given you some tips on how to measure things so you know what the expected execution time might be)