Counting multiple text files based on content

cave_dweller · 02-26-2015, 05:11 AM

Hi All,

I'd be grateful for some guidance as my Bash skills aren't cutting the mustard!

I have a need to process a large number (thousands) of small text files, to find the total numbers of files that contain certain combinations of text strings. The files are spread through multiple subdirectories.

The 'variables' in each file are a device number such as [001, 002, 003... 006], a date (mm/yy) string in the format "/10/14" (October 14) and a text string, which will either be present or not. If present it will read 'Settled'.

My need is to produce a simple report that says, for example:

001, /09/14 - Settled: 325
001, /09/14 - Total files: 817
001, /10/14 - Settled: 842
001, /10/14 - Total files: 2812
002, /09/14 - Settled: 665
002, /09/14 - Total files: 1823
etc etc.
(Format isn't important - I just need a count for each category).

I'm guessing I need to do something like:

Code:

For each [001... 006]
For each [valid month text string]
find files containing (<device> AND <date>)
Count total matching files - write number to file
find files containing (<device> AND <date> AND "Settled")
Count total matching files - write number to file

I tried using piped grep commands, moving matching files to a directory, grepping them again and then filtering the output list through wc -l, but this got messy fast!

I'm looking for a way to input the device numbers (001, 002 etc) and the valid month dates (from a file maybe?) and loop through them counting the matches.

I'm starting to think that Bash isn't the right tool, and something like Perl might be better, but I'm clueless with Perl! Anyone have suggestions please?

Thanks all.

Frustrated from Wales.

pan64 · 02-26-2015, 05:20 AM

hm. something like this:
grep -rH '[0-9][0-9][0-9], \/..\/..'
will collect all the files/lines with the required info (probably the pattern is not ok)
pipe the output into awk, where you can easily count devices, device and date or anything else.

syg00 · 02-26-2015, 06:24 AM

No need for grep, awk has (regex) pattern patching - likewise for perl.
Depending on data structure - is it well formatted, known location(s) in each record for the data, consistent field separator(s), device number always present ? ... - awk would be my weapon of choice. Good logic tests, easy counters.
Likewise for perl of course ....

cave_dweller · 02-26-2015, 07:08 AM

Quote:

Originally Posted by syg00

No need for grep, awk has (regex) pattern patching - likewise for perl.
Depending on data structure - is it well formatted, known location(s) in each record for the data, consistent field separator(s), device number always present ? ... - awk would be my weapon of choice. Good logic tests, easy counters.
Likewise for perl of course ....

The files are variable length, but the field separators are consistent and all the information (apart from the 'settled' string) are always present. Think of a layout like a cash till receipt - header, variable length body, footer. I don't care about the content of the body, but I can't always rely on exactly where the data I want will be, hence my first effort using grep.

I'll brush up on awk - it's been a while!

Thanks for the input folks.

syg00 · 02-26-2015, 07:21 AM

Post some representative records - cleansed in need if the data is sensitive.

cave_dweller · 02-26-2015, 08:53 AM

Quote:

Originally Posted by syg00

Post some representative records - cleansed in need if the data is sensitive.

Thanks. Here's a sample of one that does NOT contain the 'settled' string:

Code:

Header deleted
Body of File deleted
c0!                      SUBTOTAL     12.00   
c0!  ! T O T A L      12.00! 
c0!  NET TOTAL           VAT  A        7.43   
c0!  VAT                 00.0%         0.00   
c0!  NET TOTAL           VAT  B        3.81   
c0!  VAT                 20.0%         0.76   
c0! *4693 772/080/004/121 30.06.14 08:54 A-00 
Footer

I need to filter/group by a substring within "772/080/004/121" (The '004' could be '001' to '006' or so). Also I need the month and year from the datestamp (06.14, 07.14, 08.14 for example), shown in bold. (I mistakenly described this as being dd/mm format rather than dd.mm previously - sorry!).

In some cases there will be an additional string on the bottom line, like this:

Code:

c0!  VAT                 20.0%         0.76   
c0! *4693 772/080/004/121 30.06.14 08:54 A-00 
Footer
Settled

(The c0! is part of the text file formatting)

syg00 · 02-27-2015, 03:28 AM

No way you can do that without some logic capabilities - awk or perl or similar.
You need to find the line with device/date and capture that data - and set a flag. Then go look for "Settled" on subsequent records. Maybe you could use this as a start (lots of presumptions and no manipulation on my part)

Code:

awk -F "[[:space:]/]+" '$5 > 0 && $5 < 7 {a[$5][$7]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files ",a[i][j]}}' files.*

syg00 · 02-27-2015, 06:34 AM

Got bored while watching the soccer - how about something like this

Code:

awk -F "[[:space:]/]+" '/:/ $5 > 0 && $5 < 7 {sv1=$5 ; sv2 = substr($7,1,5) ; a[sv1][sv2]++ } ; /Settled/ {set[sv1][sv2]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files: ",a[i][j]"\n"i,j,"Settled: "set[i][j]}}' files.*

cave_dweller · 02-27-2015, 07:54 AM

Quote:

Originally Posted by syg00

Got bored while watching the soccer - how about something like this

Code:

awk -F "[[:space:]/]+" '/:/ $5 > 0 && $5 < 7 {sv1=$5 ; sv2 = substr($7,1,5) ; a[sv1][sv2]++ } ; /Settled/ {set[sv1][sv2]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files: ",a[i][j]"\n"i,j,"Settled: "set[i][j]}}' files.*

Wow. It's gonna take a while to grok that, but many thanks! (I'm gonna assume the soccer wasn't great!).

I'll give it a whirl on some sample data and report back.