Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i want to search among millions of files for ones that have a specific string. if any file has that string in an interesting way, it will be within the first 512 bytes of the file. so what i would like to find is a grep program that can be told (by option, environment variable, configuration file, whatever) to give up after 512 bytes (or some higher number if choices are limited).
i have already considered the idea of copying a limited size of the file if is larger than 512 byes. i've ruled this out for performance issues (the reason i'm looking for this, in the first place).
the grep command does have a limit feature with option -n but this counts number of matches, not number of non-matches.
so, i am looking for a better for of grep with this ability.
If I am understanding all of this correctly, this will work:
Code:
#!/bin/bash
SEARCH_STRING="search string"
MAX_BYTES=512
for file in $(find /path/to/directory -type f); do
head -c $MAX_BYTES "$file" | grep -q "$SEARCH_STRING"
if [ $? -eq 0 ]; then
echo "$file contains $SEARCH_STRING"
fi
done
As for perfomance? I have no idea. There is probably a more 'one-liner-y' way to do this. If you need that just reach back out.
I've never used the program but it is designed to scan binary files looking for specific signatures so that makes sense there is a bunch of other dependencies.
anyway, if there are really millions of files to scan better to forget shell and grep (one by one).
Theoretically you can use dd to read only 512 bytes, but that will be slow again (because you need to fork dd for every file).
Much faster solution would be to use a language which can read files directly and check the content without any external/other tool.
Like python or perl.
But simply grep -m 1 -r <pattern> <dir> might work for you.
What's wrong with dd ?. Use it to read a single sector.
No way to stop this all polluting the system tho'.
One off or regular/intermittent requirement ?.
nothing really wrong with it, but it's kind of like head. it involves piping the data between 2 processes, which appears to be the best solution short of hacking grep.
the comments thing involves generating a string that is to be added by others to source to be managed. when this key added as a comment is used, the file is being handled for managing. actually they can be inserted anywhere. but it would likely be easier to add the key in the form of a comment at the front or back of the file. for some languages like Python, a form which is a large string literal that is discarded or otherwise does not affect the program could be used to hold the key.
the grep thing is to look for some files in my personal archive of all kinds of files of which maybe about 5% is source code. i just happen to know that what i am looking for is only in short files (easy to filter) in some cases or at the beginning of larger files (not at the end) in some other cases. for this i usually need to manually check files to see if they are what i need to find, but i can only remember poorly what strings are involved. i recently needed to find someone's name i could not remember but i could remember their street. but i this case the files could have lots of data appended. i could have spent a few hours on it and worked out a way to find it. but i did know the street this person lived (same as my own back then). the long scan gave me a list of about 50 files, which was a small enough list to check manually.
anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.