summing up count returned by several bash commands to divide load
I have a script that executes the following to count all instances with the matching regex:
zcat /path/to/logs/today/* | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l
It basically looks through every file in the today directory, finds every line that matches that regex, and returns the SUM of those lines for a report I run.
This works fine on a directory that contains <15000 individual files, but anytime I run it on a directory with <15K files, I get "argument too long", regardless of the piping that occurs after the zcat
So the only way I can think to accomplish this, is to run it in stages, for example:
Stage 1: ls -l | wc -1 (this returns total count of files in directory, ex: 45000)
Stage 2: Divide by 3 = 3x15000 sets of files
Stage 3: Run the command on the first 15000 files (listed alphabetically) and return that count: zcat /path/to/logs/today/* | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l
Stage 4: Run the command on the second set of 15k and return a count
Stage 5: Run the command on the third set of 15K and return a count
Stage 6: Sum up the counts of all returned in all three sets
Can anyone suggest a way to achieve this? The above command is executed in part of a script using variables for the directory.
You can try:
Alternatively you could use find to serialize such access:
In both cases, there is one zcat process per file. Otherwise you have to do some awkward thing like reading the file name 1000 times to make a list, then execute a zcat for that 1000 list -- and you still have the issue of creating that list using something like find.
You, Sir, are awesome. Thank you so much for taking the time to think about the problem and come up with a great solution. I ended up going with your second solution as it'll fit into my original script much more easily. I owe you a beer!
|All times are GMT -5. The time now is 07:08 AM.|