LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 04-10-2023, 01:57 PM   #16
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176

Quote:
Originally Posted by pan64 View Post
anyway, you can use that grep command alone, or you can implement a more suitable tool for yourself. Just remember forking a new process (or more) for every and each file will extremely slow down this search, so better to avoid that. You ought to use a language like c, perl, python for that, which can recognize file types, can limit the search for the beginning of files and you can also implement any kind of filters. bash is not really suitable for this. (and probably awk is usable, but I would rather try something else).
(from my side I don't know what's wrong with that grep, it will list you all the files where the pattern was found much faster than any other solution posted here, anyway just tell us if you found something better).
i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.

for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.
 
Old 04-11-2023, 12:51 AM   #17
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,849

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
Quote:
Originally Posted by Skaperen View Post
i had been thinking to make my own like that, prototyping it in Python with final in C. but i still need to contemplate what grep features will be needed in the future, among those i could implement.

for now, i am on spinning platters, so head seeks at unknown points in time will make performance harder to evaluate (because it will dominate timings) and just plain be slow.
If the bottleneck is the drive you can use python. But I think it is the extremely inefficient code you use. Anyway if your disk is that slow you cannot speed it up, because you have to read those files. In that case you ought to create a database or something similar to make it significantly faster.
 
Old 04-13-2023, 06:24 PM   #18
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.
 
Old 04-13-2023, 07:13 PM   #19
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
Personally, I would move to "the true programming language of your choice." Perl, PHP, Ruby, whatever.

The first line of your script is a "shebang" ... such as #!/usr/bin/perl. And, off you go. The shell reads this line, "forks" the appropriate child process, and hands over control and the remaining source-code to it. The end-user is none the wiser. (Nor does he even care.)

Write a program that navigates through the file hierarchy, starting with the location that you provide as the first program argument. (The directory-navigation logic is provided by the language, and every language has one ... each its own.) Your program attempts to open each file – graciously handling any refusals. Then, it reads the first 512 (or whatever) bytes from it, and then performs a regular-expression match, printing the name of every file that qualifies.

The "performance" of your program will be constrained by how fast it can navigate through the directory tree, and I would argue that you really can't improve upon this because, in the end, you are dealing with a physical device. Therefore, I see no productive benefit from "multi-threading and so forth."

In any "real [interpreted ...] programming language" that I can now think of, this task should require only a couple of days to perfect. It will get the job done, and it should run very acceptably fast. "Problem solved."

Last edited by sundialsvcs; 04-13-2023 at 07:21 PM.
 
Old 04-14-2023, 12:25 AM   #20
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,849

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
Quote:
Originally Posted by Skaperen View Post
i use Python for almost everything these days. if it's too slow in Python, i consider that to be a prototype and do it over in C. i've needed to do that only once in the past 10 years of coding Python.
I don't know. Without checking your code we cannot say anything, but scanning millions of files will definitely take some time. You can check for example:
Code:
time find <dir> -type f >/dev/null                 # just finding the files
time find <dir> -type f -exec cat {} \; >/dev/null # reading those files, this means a huge amount of cat execution
# or
time grep -r -m 1 . <dir>   # . is the pattern here
to see the absolute minimal execution time. There is no way to be faster (especially on a spinning drive).
 
Old 04-14-2023, 03:49 AM   #21
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,128

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Been 8 days since the thread was launched - I wonder how many files could be grepped in that time ... ???
 
Old 04-14-2023, 09:09 PM   #22
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by syg00 View Post
Been 8 days since the thread was launched - I wonder how many files could be grepped in that time ... ???
maybe a quarter million :-)

i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.

the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?

grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"

else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.
 
Old 04-15-2023, 01:36 AM   #23
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,849

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
Quote:
Originally Posted by Skaperen View Post
maybe a quarter million :-)

i was hoping there was some feature i had overlooked or some not so well known alternate implementation. but it appears i need to consider some other alternate.

the first i'll probably do is get the grep source and see how hard or easy it is to add an extent feature allowing the user to specify the extent (in bytes or larger units) of the file to grep in. if i am successful, then i would send a patch to the author. suggestions for a syntax?

grep already has -r so i don't need to add that. i already have working code to do recursive flattening (e.g. just call to get next file) in both C and Python (its walk API is rather clunky, so i never use it) for "make my own"

else, i'll make my own. it may be integrated with file recursion or not. i may do the prototype in Python. i may do the final thing in C.
You gave us nothing. No measurements, no sample code, no facts, log files, tests, just a few words about your plans.
It is completely ok from my side, just I can't see any progress.
Based on this last post I think you have almost nothing but a wish and no idea how to realize that.
 
Old 04-15-2023, 02:32 PM   #24
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,793

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
The following is slow bash code but uses minimal I/O
Code:
find /path/to/directory -type f -exec /bin/bash -c '
  SEARCH_STRING="search string"
  for fn
  do
    read -rN 512 rec < "$fn"
    case $rec in ( *"$SEARCH_STRING"* ) echo "$fn"; esac
  done
' bash.bash {} +
 
Old 04-16-2023, 01:34 PM   #25
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.
 
Old 04-16-2023, 01:50 PM   #26
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,849

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
Quote:
Originally Posted by Skaperen View Post
when i am planning a project that involves writing some code, i have never understood why so many people expect to see a sample code before i have even decided how i will do it.
No, we do not expect that. We simply can't help improve the code if we can't examine it.
(I've given you some tips on how to measure things so you know what the expected execution time might be)
 
  


Reply

Tags
grep



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to delete files from an EXT4 folder containing tens of millions of files - SATA HDD rylan76 Linux - Software 9 06-15-2022 04:04 AM
millions of mysqld empty files to be deleted with a script Majed17 Linux - Software 3 03-01-2013 02:27 AM
how to quickly remove millions of files? unittester Linux - Newbie 5 11-27-2008 10:27 PM
Grepping for text within files GoTerpsGo Linux - Newbie 6 09-09-2008 07:52 AM
deleting millions of files at once Red Squirrel Linux - Software 6 05-15-2005 03:59 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 07:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration