LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-26-2020, 06:19 PM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 267

Rep: Reputation: 6
speed up grepping a file for a long list of needles


hi,
i have a long list of needles. i need to know which ones are present inside a file.

i.e.:
Code:
$ wc -l needles.txt
3589 needles.txt

$ head -3 needles.txt
this_string_is_present
this_is_not
and_so_on

$ wc -l hay.txt
756 hay.txt

$ head -3 hay.txt
this file contains a lot of strings: this_string_is_present
some needles are present
and some are not
a simple (and SLOW) solution could be:
Code:
hay=$(< hay.txt) # store hay.txt in a variable to avoid reading the disk thousands of times

while read needle; do
  grep -q "${needle}" <<< "${hay}" \
    && needles+=( "${needle} - verified" ) \
    || needles+=( "${needle}" )
done < needles.txt
another (still pretty slow) solution could be using grep -fo and comm:
Code:
sort needles.txt > sorted-needles.txt
grep -fo needles.txt hay.txt | sort > verified-needles.txt

comm -23 sorted-needles.txt verified-needles.txt > unverified-needles.txt

can you suggest a better solution?
thanks!

Last edited by masavini; 03-27-2020 at 04:25 AM.
 
Old 03-26-2020, 07:57 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,589

Rep: Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112Reputation: 3112
Simple first step would be to put all your generated files on tmpfs - on my systems that's as simple as dropping them on /tmp
For such a small amount of data the needles and hay should stay resident in page cache - if you really need the unverified list, copy it up to disk after you've finished screwing around - er testing better schemes ....
 
Old 03-27-2020, 07:59 PM   #3
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,338

Rep: Reputation: 610Reputation: 610Reputation: 610Reputation: 610Reputation: 610Reputation: 610
What's the speed of the following?
Code:
needles=()
while read needle
do
  needles+=( "$needle" )
done < needles.txt

needles_v=()
while read line
do
  for needle in "${needles[@]}"
  do
    case $line in
    ( *"$needle"* )
      needles_v+=( "$needle" )
    ;;
    esac
  done
done < hay.txt
 
Old 03-27-2020, 08:35 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,559

Rep: Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862Reputation: 1862
Quote:
Originally Posted by masavini View Post
another (still pretty slow) solution could be using grep -fo and comm:
Code:
sort needles.txt > sorted-needles.txt
grep -fo needles.txt hay.txt | sort > verified-needles.txt

comm -23 sorted-needles.txt verified-needles.txt > unverified-needles.txt
Using grep -F (aka --fixed-strings, aka fgrep) is a lot faster:

Code:
grep -F -of needles.txt hay.txt | sort -u > verified-needles.txt
grep -F -vf verified-needles.txt needles.txt > unverified-needles.txt
 
1 members found this post helpful.
Old 03-28-2020, 12:33 AM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 9,486

Rep: Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223Reputation: 4223
1. Would using Ripgrep instead of Grep help?
2. Can you use one regular expression for all the needles?
 
Old 03-28-2020, 10:33 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,762

Rep: Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612
Quote:
Originally Posted by masavini View Post
i have a long list of needles. i need to know which ones are present inside a file.
1) Is the file huge?
2) Is the file invariant (example: the collected works of Leo Tolstoy)?

Daniel B. Martin

Last edited by danielbmartin; 03-28-2020 at 10:39 AM.
 
  


Reply

Tags
bash, grep, loop


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Aboout "with very long lines",how long is very long? yun1st Linux - Newbie 4 07-20-2012 03:38 PM
grepping a log file graziano1968 Linux - General 11 12-08-2011 02:52 AM
Need a shell script help - grepping list of entries from another file? Jykke Linux - General 3 07-12-2011 07:40 AM
long long long: Too long for GCC Kenny_Strawn Programming 5 09-18-2010 01:14 AM
Grepping a file for Text going down. keysorsoze Linux - Newbie 3 01-17-2007 09:06 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:19 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration