Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a big stack of .pdf files that I need to reference, and would like to be able to quickly search them for key words.
For example, it would be great if I could get an output that simply gives me a list of filenames for pdf files containing the word `bananas`
I would like to do this without creating a mountain of text files, but am struggling a little as my bash skills are not up to much.
Here is what I have so far:
for i in `ls *.pdf`; do pdftotext $i -; done
I would like to pipe this through `grep -l` to filter out filenames containing my key words, but am not sure that this will work as I think that it is the contents of the file only and not the filename that get handled by `pdftotext`? I've tried piping to grep and have tried redirecting std output, but so far everything has been doomed to failure.
I can do it if I use pdftotext to create text files first, then run grep on those, but I am in search of a more elegant solution!
It seems what you want to do is certainly impossible, since pdftotext writes to files and not standard out. You'd have to somehow re-write pdftotext to output to stdout and not some file specified by name at the command-line.
The best solution I can think of is something like this:
But don't take the output to be complete! This method won't work on pdf containing hyphenation at linebreak, e. g. ba-\nnana. Best would be to strip the output of pdftotext of hyphens and linebreaks before you grep it. Otherwise you shouldn't rely on the one-liner too much.
[EDIT] Checked it on one paper and it seems that it's not linebreaks that matter but pagebreaks, especially when there's header or footer text that would be very hard to strip with sed.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.