LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 04-05-2023, 06:30 PM   #1
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
grepping millions of files faster


i want to search among millions of files for ones that have a specific string. if any file has that string in an interesting way, it will be within the first 512 bytes of the file. so what i would like to find is a grep program that can be told (by option, environment variable, configuration file, whatever) to give up after 512 bytes (or some higher number if choices are limited).

i have already considered the idea of copying a limited size of the file if is larger than 512 byes. i've ruled this out for performance issues (the reason i'm looking for this, in the first place).

the grep command does have a limit feature with option -n but this counts number of matches, not number of non-matches.

so, i am looking for a better for of grep with this ability.
 
Old 04-05-2023, 07:02 PM   #2
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,278

Rep: Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694
If I am understanding all of this correctly, this will work:

Code:
#!/bin/bash

SEARCH_STRING="search string"
MAX_BYTES=512

for file in $(find /path/to/directory -type f); do
  head -c $MAX_BYTES "$file" | grep -q "$SEARCH_STRING"
  if [ $? -eq 0 ]; then
    echo "$file contains $SEARCH_STRING"
  fi
done
As for perfomance? I have no idea. There is probably a more 'one-liner-y' way to do this. If you need that just reach back out.
 
Old 04-05-2023, 07:04 PM   #3
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,699

Rep: Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895
binwalk might work for your needs. It will search for a raw string with a specified sequence and you can limit the number of bytes to scan.
 
1 members found this post helpful.
Old 04-05-2023, 07:11 PM   #4
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,278

Rep: Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694
binwalk is a cool program. Nice suggestion.

FWIW: when i installed it on my debian system just now, it had to install 491MB of other packages to make it work!
 
Old 04-05-2023, 07:17 PM   #5
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,699

Rep: Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895
I've never used the program but it is designed to scan binary files looking for specific signatures so that makes sense there is a bunch of other dependencies.
 
Old 04-05-2023, 07:47 PM   #6
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,222

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
At this scale? It's time feed them into ElasticSearch.
 
Old 04-05-2023, 11:15 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
What's wrong with dd ?. Use it to read a single sector.
No way to stop this all polluting the system tho'.

One off or regular/intermittent requirement ?.
 
Old 04-06-2023, 12:14 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,836

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
anyway, if there are really millions of files to scan better to forget shell and grep (one by one).
Theoretically you can use dd to read only 512 bytes, but that will be slow again (because you need to fork dd for every file).
Much faster solution would be to use a language which can read files directly and check the content without any external/other tool.
Like python or perl.
But simply grep -m 1 -r <pattern> <dir> might work for you.
 
Old 04-06-2023, 08:31 AM   #9
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,222

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Also: solid drive or platter drive? If it’s a platter drive then the seeks are going to be a huge bottleneck.
 
Old 04-06-2023, 07:53 PM   #10
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
Also: solid drive or platter drive? If it’s a platter drive then the seeks are going to be a huge bottleneck.
right. and it is a few spinning platters for now. i'm hoping to get this to be solid next year.
 
Old 04-06-2023, 07:53 PM   #11
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by syg00 View Post
What's wrong with dd ?. Use it to read a single sector.
No way to stop this all polluting the system tho'.

One off or regular/intermittent requirement ?.
nothing really wrong with it, but it's kind of like head. it involves piping the data between 2 processes, which appears to be the best solution short of hacking grep.
 
Old 04-06-2023, 07:55 PM   #12
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by pan64 View Post
But simply grep -m 1 -r <pattern> <dir> might work for you.
maybe with a larger -m.
 
Old 04-07-2023, 01:19 PM   #13
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,222

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Is this your use case?

https://www.linuxquestions.org/quest...9/#post6422884
 
Old 04-08-2023, 02:07 PM   #14
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
no.

the comments thing involves generating a string that is to be added by others to source to be managed. when this key added as a comment is used, the file is being handled for managing. actually they can be inserted anywhere. but it would likely be easier to add the key in the form of a comment at the front or back of the file. for some languages like Python, a form which is a large string literal that is discarded or otherwise does not affect the program could be used to hold the key.

the grep thing is to look for some files in my personal archive of all kinds of files of which maybe about 5% is source code. i just happen to know that what i am looking for is only in short files (easy to filter) in some cases or at the beginning of larger files (not at the end) in some other cases. for this i usually need to manually check files to see if they are what i need to find, but i can only remember poorly what strings are involved. i recently needed to find someone's name i could not remember but i could remember their street. but i this case the files could have lots of data appended. i could have spent a few hours on it and worked out a way to find it. but i did know the street this person lived (same as my own back then). the long scan gave me a list of about 50 files, which was a small enough list to check manually.
 
Old 04-09-2023, 02:04 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,836

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).
 
  


Reply

Tags
grep



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to delete files from an EXT4 folder containing tens of millions of files - SATA HDD rylan76 Linux - Software 9 06-15-2022 04:04 AM
millions of mysqld empty files to be deleted with a script Majed17 Linux - Software 3 03-01-2013 02:27 AM
how to quickly remove millions of files? unittester Linux - Newbie 5 11-27-2008 10:27 PM
Grepping for text within files GoTerpsGo Linux - Newbie 6 09-09-2008 07:52 AM
deleting millions of files at once Red Squirrel Linux - Software 6 05-15-2005 03:59 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 05:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration