split a file and process resulting files in parallell

matajaz · 10-28-2013, 10:38 AM

Hi,

I have huge (100 Gbytes) files which I need to post process after they have been generated.

I wonder if there is an easy way to split the file and then process each file coming form split file.
I mean doing it automatically on one command line without waiting for split output and then start the processing commands.

Here is an example where I want to split a file and the count number of McDonald words in each file.

split -b 1000m -a 3 sourcefile.txt resultfile | foreach splitted file do "grep -c McDonald"

I hope you understand what I want to do.

Br Mathias
PS. The file server is very fast so I do not expect IO Wait to be limiting factor.

pan64 · 10-29-2013, 06:27 AM

Code:

#step 1 split the file into smaller parts.
split ....
#step 2 run your grep on all the parts in the same time
for f in <list of splitted parts>
do 
grep -c McDonald > $f.count
done
#step 3 wait for the result
wait
#step 4 sum up the results
for f in <list of splitted parts>
do
SUM=$((SUM+`cat $f.count`))
done
# but probably it will run longer than a simple grep on the single input file.

this is not a runnable script, but a plan to implement it.

smeezekitty · 10-29-2013, 01:18 PM

That won't actually run in parallel because there is no "&"

One thing to watch out for is the unlikely but possible case where split will chop the desired word in two while will cause that instance to be missed.