LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 11-19-2010, 03:38 PM   #1
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Rep: Reputation: 17
[bash] find - filter files matching a list of files


Scenario
A folder contains about 290,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.
Code:
data/<yyyy>/<mm>/event_<id>.html
I file idnew.log contains 20,000 id's, one per line:

Code:
12345
3456
39999
399999
I need to build up a list of files matching this list of id's.

First attempt time: 0.44s
Code:
find data/ -type f |grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)
I use sed to prefix and suffix the id: _<id>. to ensure only filenames containing the whole id are found.



I think this is pretty fast. I am just wondering whether there is a more elegant way of achieving the result, particularly considering that I need to return all files in data/ if idnew.log does not exist:

The only solution I can think of:
Code:
find data/ -type f >fileall.log
cp fileall.log filesome.log
if [[ -e idnew.log ]] ; then
    grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log) fileall.log >filesome.log
fi
Well, this is pretty ugly!

Last edited by hashbang#!; 11-19-2010 at 03:41 PM.
 
Old 11-19-2010, 04:54 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,519

Rep: Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868
If you don't need to generate fileall.log (if you do, just call tee):
Code:
find data/ -type f | {
    if [[ -e idnew.log ]] ; then
        grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)
    else
        cat
    fi
} > filesome.log
Although you didn't say exactly what you mean by "elegant" or "ugly".

Last edited by ntubski; 11-19-2010 at 04:56 PM. Reason: mention tee
 
1 members found this post helpful.
Old 11-20-2010, 04:42 AM   #3
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by ntubski View Post
you didn't say exactly what you mean by "elegant" or "ugly".
How true. I didn't like tmp file (not needed otherwise).

Funny, I had tried something similar but it did not work with the process substitution:

Code:
if [[ -e idnew.log ]] ; then
    FILTER="grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)"
else
    FILTER=cat
fi
find data/ -type f | $FILTER > filesome.log
I did not realize you could have an if construct within the pipe.



Also, are there alternatives to my grep <(sed ...)?
 
Old 11-20-2010, 12:33 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,519

Rep: Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868Reputation: 868
Quote:
Originally Posted by hashbang#! View Post
Funny, I had tried something similar but it did not work with the process substitution:
You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally. Shell scripting is just funny that way.
 
Old 11-20-2010, 06:38 PM   #5
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Code:
find ... |grep -F -f <(awk '{print "_" $0 "."}' idnew.log)
I am preprocessing the id with awk list so I can use the lightning-fast grep -F -f.

Is there a way grep can be made recognize any non-numeral as word boundary (a bit like the shell's $IFS)? Then the list could be used as is.

Last edited by hashbang#!; 11-21-2010 at 07:18 AM.
 
Old 11-20-2010, 06:41 PM   #6
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by ntubski View Post
You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally.
Code:
find data/ -type f | eval $FILTER > filesome.log
works - that's a new one to me. Whether this is any prettier (more readable) than your if within the pipe is another question.

Last edited by hashbang#!; 11-20-2010 at 06:48 PM.
 
Old 11-25-2010, 06:01 PM   #7
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
I am using sed/awk to isolate the id in in the filename to exclude part-matching id's.

If grep's word boundaries could be redifined (in this case to non-integer), this would be superfluous.

Is there something like $IFS for grep?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Find/grep command to find matching files, print filename, then print matching content stefanlasiewski Programming 8 12-18-2013 06:36 PM
Find/grep/wc command to find matching files, print filename and word count dbasch Linux - Newbie 10 09-14-2009 06:55 PM
Rename files matching a list Qu3ry Programming 34 07-22-2009 07:59 PM
bash: better way to delete files not matching a regex? pbhj Programming 8 10-15-2007 04:05 PM
list files NOT matching a pattern smart_sagittari Linux - Newbie 9 05-20-2005 06:32 AM


All times are GMT -5. The time now is 07:36 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration