Bash script to fgrep a large file. With list as source for searching.

the_file · 01-17-2011, 01:02 PM

Hi,
I need to fgrep a list of things which are in a file. The file in which I will do the SEACHING is a large text file and I need fgrep to output each item from the list as a file with the item from the list as the file name.

Its kinda like this:

./script list.txt largefile.txt

output would be

jack.txt
screen.txt
blah.txt

I don't know bash all to well since I am learning it. Can anybody write this kind of thing?.

Thanks in advance.

ruario · 01-17-2011, 01:27 PM

Assuming list.txt contains:

Code:

jack
screen
blah

And largefile.txt contained something like:

Code:

hello
jack
screen
dog
john
food
street
blah
corner
clock
bike

Then I think what you are asking for is:

Code:

fgrep -f list.txt largefile.txt | sed "s/$/.txt/"

So your script would therefore look something like:

Code:

#!/bin/sh
fgrep -f $1 $2 | sed "s/$/.txt/"

ruario · 01-17-2011, 01:33 PM

If largefile.txt looks more like:

Code:

There was a guy called jack. He liked to watch tv.
But only if the tv had a large screen.

He tried to convince friends that this was the best
way but they found boring and hadly listened. To
them it sounded like blah.

Then you probably want:

Code:

fgrep -of list.txt largefile.txt | sed "s/$/.txt/"

and your script would therefore look something like:

Code:

#!/bin/sh
fgrep -of $1 $2 | sed "s/$/.txt/"

the_file · 01-17-2011, 02:13 PM

Unfortunatly those scripts didn't work at all =(

I need to have each item from the list be a file each containing restults from the large file, essentially I want to the grab the whole line that containts something from the list. And the search items do have white spaces =/

But non the less I think were getting close.

ruario · 01-17-2011, 02:39 PM

I honestly cannot image what you are asking. Do you want to actually create files? Could you provide me an example of what list.txt and largefile.txt might look like. And if you are trying to output files, what you expect the contents of say "jack.txt" would look like after your script successfully completed.

ruario · 01-17-2011, 04:01 PM

Ok, I thought about this a little and I think I understand what you want.

Assuming list.txt contains:

Code:

jack
screen
blah

and largefile.txt contained:

Code:

There was a guy called jack. He liked to watch tv
but only if the tv had a large screen.

He tried to convince his friends to join him
but thought this sounded like a load of blah.

You want your command to produce three files.

A jack.txt file that contains:

Code:

There was a guy called jack. He liked to watch tv

A screen.txt file that contains:

Code:

but only if the tv had a large screen.

A blah.txt file that contains:

Code:

but thought this sounded like a load of blah.

Is this what you had in mind??

ruario · 01-17-2011, 04:12 PM

Hmm ... Ok I assume you wanted to make a script because you thought this would be hard but if it was me I'd probably not bother with making a script and do it as one line with the wonderful GNU Parallel.

Code:

parallel -a list.txt 'fgrep "{}" largefile.txt > "{}.txt"'

P.S. If you don't have parallel search for it in your distro's repository and install it. It is great for stuff like this and a whole lot more!

ruario · 01-17-2011, 04:38 PM

Re-reading your original request it was actually quite clear. For some reason I had presumed this was just part of some script that you were working on. I hadn't realised that you had summed up your entire requirements. Sorry for the confusion before!

grail · 01-17-2011, 06:06 PM

Well not as clean as parallel (which I don't have either

), the following awk can work:

Code:

awk 'NR=FNR{words[i++]=$0;next}{for(x=0;x<i;x++)if($0 ~ words[x])print > words[x]".txt"}' list.txt largefile.txt

Or with bash:

Code:

#!/bin/bash

while read -r word
do
    grep $word largefile.txt > ${word}.txt
done<list.txt

ruario · 01-17-2011, 11:28 PM

Quote:

Originally Posted by grail

Well not as clean as parallel (which I don't have either

)

Consider getting it. What I did was just a really simple demo of what is possible. Check out these two introductory videos by the parallel author himself if you really want to get a glimpse of what is possible:

http://www.youtube.com/watch?v=OpaiGYxkSuQ
http://www.youtube.com/watch?v=P40akGWJ_gY

If for some reason your favoured distro does not include parallel, you can always get it from here:

http://www.gnu.org/software/parallel/

There are links to rpms and debs as well as the source. I can't recommend it highly enough.

grail · 01-18-2011, 12:07 AM

@ruario - cheers

will check it out

tange · 01-22-2011, 08:47 PM

Quote:

Originally Posted by ruario

Code:

parallel -a list.txt 'fgrep "{}" largefile.txt > "{}.txt"'

Here is a tiny optimization. GNU Parallel is quite liberal in quoting, so you only need to quote special shell chars (in this case the >):

Code:

parallel -a list.txt fgrep {} largefile.txt \> {}.txt

This with do The Right Thing even if list.txt contains lines with words and spaces.

I know it is hard to get used to, when you are used to xargs' need for quoting everything.

/Ole
PS: Thanks for http://my.opera.com/ruario/blog/2011...h-gnu-parallel

ruario · 01-23-2011, 07:50 AM

@tange: Wow a reply from the Parallel author himself!

Thanks for the quoting tip. Yeah that might take some getting used to but I can see it would make things so much more readable when applied to more complex examples.

I'm glad you read my blog post. I had been meant to write it for a while and it was actually this thread that reminded me to do it. I only touch on the basic stuff there because the few readers I have tend to be those interested in Opera development, so I wanted to use an example that would mean something to them. Also you have covered the more powerful stuff in detail in the documentation you provide already.

P.S. Thanks for Parallel. I couldn't live without it now. I just hope a few more distros start to include it by default.

ruario · 01-24-2011, 01:08 AM

@tange: I decided to write another post. Once again my example is recursive unpacking of an archive but this time I pull apart a deb for the purpose editing and then put it back together (both times using Parallel). Hopefully this will be interesting to a wider range of people and hence encourage more people to take a look at your software.

http://my.opera.com/ruario/blog/2011...e-fun-with-gnu

ruario · 01-24-2011, 03:42 PM

Whilst you obviously should use parallel. at a push on a system without it installed you can force xargs to do this as follows:

Code:

xargs -a list.txt -d "\n" -I {} bash -c "fgrep '{}' largefile.txt > '{}.txt'"