LinuxQuestions.org - [bash] find - filter files matching a list of files

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - [bash] find - filter files matching a list of files (https://www.linuxquestions.org/questions/programming-9/%5Bbash%5D-find-filter-files-matching-a-list-of-files-845377/)

[bash] find - filter files matching a list of files

Scenario
A folder contains about 290,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.

Code:

data/<yyyy>/<mm>/event_<id>.html

I file idnew.log contains 20,000 id's, one per line:

Code:

I need to build up a list of files matching this list of id's.

First attempt time: 0.44s

Code:

find data/ -type f |grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)

I use sed to prefix and suffix the id: _<id>. to ensure only filenames containing the whole id are found.

I think this is pretty fast. I am just wondering whether there is a more elegant way of achieving the result, particularly considering that I need to return all files in data/ if idnew.log does not exist:

The only solution I can think of:

Code:

find data/ -type f >fileall.log

cp fileall.log filesome.log

if [[ -e idnew.log ]] ; then

    grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log) fileall.log >filesome.log

fi

Well, this is pretty ugly!

If you don't need to generate fileall.log (if you do, just call tee):

Code:

find data/ -type f | {

    if [[ -e idnew.log ]] ; then

        grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)

    else

        cat

    fi

} > filesome.log

Although you didn't say exactly what you mean by "elegant" or "ugly".

Quote:

Originally Posted by ntubski (Post 4164772)

you didn't say exactly what you mean by "elegant" or "ugly".

How true. I didn't like tmp file (not needed otherwise).

Funny, I had tried something similar but it did not work with the process substitution:

Code:

if [[ -e idnew.log ]] ; then

    FILTER="grep -F -f <(sed -r 's/(.*)/_\1./' idnew.log)"

else

    FILTER=cat

fi

find data/ -type f | $FILTER > filesome.log

I did not realize you could have an if construct within the pipe.

Also, are there alternatives to my grep <(sed ...)?

Quote:

Originally Posted by hashbang#! (Post 4165178)

Funny, I had tried something similar but it did not work with the process substitution:

You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally. Shell scripting is just funny that way. :(

Code:

find ... |grep -F -f <(awk '{print "_" $0 "."}' idnew.log)

I am preprocessing the id with awk list so I can use the lightning-fast grep -F -f.

Is there a way grep can be made recognize any non-numeral as word boundary (a bit like the shell's $IFS)? Then the list could be used as is.

Quote:

Originally Posted by ntubski (Post 4165481)

You would need to call eval because variables contents are subject only to word splitting, special characters like "<" are taken literally.

Code:

find data/ -type f | eval $FILTER > filesome.log

works - that's a new one to me. Whether this is any prettier (more readable) than your if within the pipe is another question.

I am using sed/awk to isolate the id in in the filename to exclude part-matching id's.

If grep's word boundaries could be redifined (in this case to non-integer), this would be superfluous.

Is there something like $IFS for grep?