LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   summing up count returned by several bash commands to divide load (https://www.linuxquestions.org/questions/linux-newbie-8/summing-up-count-returned-by-several-bash-commands-to-divide-load-4175459299/)

epols 04-23-2013 10:49 AM

summing up count returned by several bash commands to divide load
 
I have a script that executes the following to count all instances with the matching regex:

zcat /path/to/logs/today/* | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l

It basically looks through every file in the today directory, finds every line that matches that regex, and returns the SUM of those lines for a report I run.

This works fine on a directory that contains <15000 individual files, but anytime I run it on a directory with <15K files, I get "argument too long", regardless of the piping that occurs after the zcat

So the only way I can think to accomplish this, is to run it in stages, for example:

Stage 1: ls -l | wc -1 (this returns total count of files in directory, ex: 45000)
Stage 2: Divide by 3 = 3x15000 sets of files
Stage 3: Run the command on the first 15000 files (listed alphabetically) and return that count: zcat /path/to/logs/today/* | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l
Stage 4: Run the command on the second set of 15k and return a count
Stage 5: Run the command on the third set of 15K and return a count
Stage 6: Sum up the counts of all returned in all three sets

Can anyone suggest a way to achieve this? The above command is executed in part of a script using variables for the directory.

jpollard 04-23-2013 02:20 PM

You can try:
Code:

for i in /path/to/logs/today/* ; do
  zcat $i
done | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l

The major difference is that the "for i in" (and the wildcard list) is handled by bash and is not passed as a parameter list (as in "zcat /path/to/logs/today/*") is done, so it doesn't have the same restrictions (memory allocation for parameters for an exec...)

Alternatively you could use find to serialize such access:

Code:

find /path/to/logs/today -name '*' -exec zcat {} ';' | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l
This works because find is performing the "readdir" and the filename expansion search (the -name '*'), then executes zcat on each file it finds.

In both cases, there is one zcat process per file. Otherwise you have to do some awkward thing like reading the file name 1000 times to make a list, then execute a zcat for that 1000 list -- and you still have the issue of creating that list using something like find.

epols 04-24-2013 07:58 AM

Quote:

Originally Posted by jpollard (Post 4937448)
You can try:
Code:

for i in /path/to/logs/today/* ; do
  zcat $i
done | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l

The major difference is that the "for i in" (and the wildcard list) is handled by bash and is not passed as a parameter list (as in "zcat /path/to/logs/today/*") is done, so it doesn't have the same restrictions (memory allocation for parameters for an exec...)

Alternatively you could use find to serialize such access:

Code:

find /path/to/logs/today -name '*' -exec zcat {} ';' | grep '%[A-Z0-9_]\+-' | grep -v 'Primary ID' | wc -l
This works because find is performing the "readdir" and the filename expansion search (the -name '*'), then executes zcat on each file it finds.

In both cases, there is one zcat process per file. Otherwise you have to do some awkward thing like reading the file name 1000 times to make a list, then execute a zcat for that 1000 list -- and you still have the issue of creating that list using something like find.


You, Sir, are awesome. Thank you so much for taking the time to think about the problem and come up with a great solution. I ended up going with your second solution as it'll fit into my original script much more easily. I owe you a beer!


All times are GMT -5. The time now is 07:14 AM.