Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).
i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.
for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.
i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.
for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.
If the bottleneck is the drive you can use python. But I think it is the extremely inefficient code you use. Anyway if your disk is that slow you cannot speed it up, because you have to read those files. In that case you ought to create a database or something similar to make it significantly faster.
i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.
Personally, I would move to "the true programming language of your choice." Perl, PHP, Ruby, whatever.
The first line of your script is a "shebang" ... such as #!/usr/bin/perl. And, off you go. The shell reads this line, "forks" the appropriate child process, and hands over control and the remaining source-code to it. The end-user is none the wiser. (Nor does he even care.)
Write a program that navigates through the file hierarchy, starting with the location that you provide as the first program argument. (The directory-navigation logic is provided by the language, and every language has one ... each its own.) Your program attempts to open each file – graciously handling any refusals. Then, it reads the first 512 (or whatever) bytes from it, and then performs a regular-expression match, printing the name of every file that qualifies.
The "performance" of your program will be constrained by how fast it can navigate through the directory tree, and I would argue that you really can't improve upon this because, in the end, you are dealing with a physical device. Therefore, I see no productive benefit from "multi-threading and so forth."
In any "real [interpreted ...] programming language" that I can now think of, this task should require only a couple of days to perfect. It will get the job done, and it should run very acceptably fast. "Problem solved."
Last edited by sundialsvcs; 04-13-2023 at 07:21 PM.
i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.
I don't know. Without checking your code we cannot say anything, but scanning millions of files will definitely take some time. You can check for example:
Code:
time find <dir> -type f >/dev/null # just finding the files
time find <dir> -type f -exec cat {} \; >/dev/null # reading those files, this means a huge amount of cat execution
# or
time grep -r -m 1 . <dir> # . is the pattern here
to see the absolute minimal execution time. There is no way to be faster (especially on a spinning drive).
Been 8 days since the thread was launched - I wonder how many files could be grepped in that time ... ???
maybe a quarter million :-)
i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.
the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?
grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"
else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.
i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.
the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?
grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"
else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.
You gave us nothing. No measurements, no sample code, no facts, log files, tests, just a few words about your plans.
It is completely ok from my side, just I can't see any progress.
Based on this last post I think you have almost nothing but a wish and no idea how to realize that.
when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.
when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.
No, we do not expect that. We simply can't help improve the code if we can't examine it.
(I've given you some tips on how to measure things so you know what the expected execution time might be)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.