LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   [bash] find - filter files matching a list of files (https://www.linuxquestions.org/questions/programming-9/%5Bbash%5D-find-filter-files-matching-a-list-of-files-845377/)

hashbang#! 11-19-2010 02:38 PM

[bash] find - filter files matching a list of files
 
Scenario
A folder contains about 290,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.
Code:

data/<yyyy>/<mm>/event_<id>.html
I file idnew.log contains 20,000 id's, one per line:

Code:

12345
3456
39999
399999

I need to build up a list of files matching this list of id's.

First attempt time: 0.44s
Code:

find data/ -type f |grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)
I use sed to prefix and suffix the id: _<id>. to ensure only filenames containing the whole id are found.



I think this is pretty fast. I am just wondering whether there is a more elegant way of achieving the result, particularly considering that I need to return all files in data/ if idnew.log does not exist:

The only solution I can think of:
Code:

find data/ -type f >fileall.log
cp fileall.log filesome.log
if [[ -e idnew.log ]] ; then
    grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log) fileall.log >filesome.log
fi

Well, this is pretty ugly!

ntubski 11-19-2010 03:54 PM

If you don't need to generate fileall.log (if you do, just call tee):
Code:

find data/ -type f | {
    if [[ -e idnew.log ]] ; then
        grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)
    else
        cat
    fi
} > filesome.log

Although you didn't say exactly what you mean by "elegant" or "ugly".

hashbang#! 11-20-2010 03:42 AM

Quote:

Originally Posted by ntubski (Post 4164772)
you didn't say exactly what you mean by "elegant" or "ugly".

How true. I didn't like tmp file (not needed otherwise).

Funny, I had tried something similar but it did not work with the process substitution:

Code:

if [[ -e idnew.log ]] ; then
    FILTER="grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)"
else
    FILTER=cat
fi
find data/ -type f | $FILTER > filesome.log

I did not realize you could have an if construct within the pipe.



Also, are there alternatives to my grep <(sed ...)?

ntubski 11-20-2010 11:33 AM

Quote:

Originally Posted by hashbang#! (Post 4165178)
Funny, I had tried something similar but it did not work with the process substitution:

You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally. Shell scripting is just funny that way. :(

hashbang#! 11-20-2010 05:38 PM

Code:

find ... |grep -F -f <(awk '{print "_" $0 "."}' idnew.log)
I am preprocessing the id with awk list so I can use the lightning-fast grep -F -f.

Is there a way grep can be made recognize any non-numeral as word boundary (a bit like the shell's $IFS)? Then the list could be used as is.

hashbang#! 11-20-2010 05:41 PM

Quote:

Originally Posted by ntubski (Post 4165481)
You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally.

Code:

find data/ -type f | eval $FILTER > filesome.log
works - that's a new one to me. Whether this is any prettier (more readable) than your if within the pipe is another question.

hashbang#! 11-25-2010 05:01 PM

I am using sed/awk to isolate the id in in the filename to exclude part-matching id's.

If grep's word boundaries could be redifined (in this case to non-integer), this would be superfluous.

Is there something like $IFS for grep?


All times are GMT -5. The time now is 10:10 AM.