LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-26-2015, 05:11 AM   #1
cave_dweller
LQ Newbie
 
Registered: Feb 2015
Location: South Wales
Distribution: Various
Posts: 7

Rep: Reputation: Disabled
Counting multiple text files based on content


Hi All,

I'd be grateful for some guidance as my Bash skills aren't cutting the mustard!

I have a need to process a large number (thousands) of small text files, to find the total numbers of files that contain certain combinations of text strings. The files are spread through multiple subdirectories.

The 'variables' in each file are a device number such as [001, 002, 003... 006], a date (mm/yy) string in the format "/10/14" (October 14) and a text string, which will either be present or not. If present it will read 'Settled'.

My need is to produce a simple report that says, for example:

001, /09/14 - Settled: 325
001, /09/14 - Total files: 817
001, /10/14 - Settled: 842
001, /10/14 - Total files: 2812
002, /09/14 - Settled: 665
002, /09/14 - Total files: 1823
etc etc.
(Format isn't important - I just need a count for each category).

I'm guessing I need to do something like:

Code:
For each [001... 006]
For each [valid month text string]
find files containing (<device> AND <date>)
Count total matching files - write number to file
find files containing (<device> AND <date> AND "Settled")
Count total matching files - write number to file
I tried using piped grep commands, moving matching files to a directory, grepping them again and then filtering the output list through wc -l, but this got messy fast!

I'm looking for a way to input the device numbers (001, 002 etc) and the valid month dates (from a file maybe?) and loop through them counting the matches.

I'm starting to think that Bash isn't the right tool, and something like Perl might be better, but I'm clueless with Perl! Anyone have suggestions please?

Thanks all.

Frustrated from Wales.
 
Old 02-26-2015, 05:20 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,840

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
hm. something like this:
grep -rH '[0-9][0-9][0-9], \/..\/..'
will collect all the files/lines with the required info (probably the pattern is not ok)
pipe the output into awk, where you can easily count devices, device and date or anything else.
 
Old 02-26-2015, 06:24 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
No need for grep, awk has (regex) pattern patching - likewise for perl.
Depending on data structure - is it well formatted, known location(s) in each record for the data, consistent field separator(s), device number always present ? ... - awk would be my weapon of choice. Good logic tests, easy counters.
Likewise for perl of course ....
 
1 members found this post helpful.
Old 02-26-2015, 07:08 AM   #4
cave_dweller
LQ Newbie
 
Registered: Feb 2015
Location: South Wales
Distribution: Various
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
No need for grep, awk has (regex) pattern patching - likewise for perl.
Depending on data structure - is it well formatted, known location(s) in each record for the data, consistent field separator(s), device number always present ? ... - awk would be my weapon of choice. Good logic tests, easy counters.
Likewise for perl of course ....
The files are variable length, but the field separators are consistent and all the information (apart from the 'settled' string) are always present. Think of a layout like a cash till receipt - header, variable length body, footer. I don't care about the content of the body, but I can't always rely on exactly where the data I want will be, hence my first effort using grep.

I'll brush up on awk - it's been a while!

Thanks for the input folks.
 
Old 02-26-2015, 07:21 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Post some representative records - cleansed in need if the data is sensitive.
 
Old 02-26-2015, 08:53 AM   #6
cave_dweller
LQ Newbie
 
Registered: Feb 2015
Location: South Wales
Distribution: Various
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
Post some representative records - cleansed in need if the data is sensitive.
Thanks. Here's a sample of one that does NOT contain the 'settled' string:


Code:
Header deleted
Body of File deleted
c0!                      SUBTOTAL     12.00   
c0!  ! T O T A L      12.00! 
c0!  NET TOTAL           VAT  A        7.43   
c0!  VAT                 00.0%         0.00   
c0!  NET TOTAL           VAT  B        3.81   
c0!  VAT                 20.0%         0.76   
c0! *4693 772/080/004/121 30.06.14 08:54 A-00 
Footer
I need to filter/group by a substring within "772/080/004/121" (The '004' could be '001' to '006' or so). Also I need the month and year from the datestamp (06.14, 07.14, 08.14 for example), shown in bold. (I mistakenly described this as being dd/mm format rather than dd.mm previously - sorry!).

In some cases there will be an additional string on the bottom line, like this:

Code:
c0!  VAT                 20.0%         0.76   
c0! *4693 772/080/004/121 30.06.14 08:54 A-00 
Footer
Settled
(The c0! is part of the text file formatting)
 
Old 02-27-2015, 03:28 AM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
No way you can do that without some logic capabilities - awk or perl or similar.
You need to find the line with device/date and capture that data - and set a flag. Then go look for "Settled" on subsequent records. Maybe you could use this as a start (lots of presumptions and no manipulation on my part)
Code:
awk -F "[[:space:]/]+" '$5 > 0 && $5 < 7 {a[$5][$7]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files ",a[i][j]}}' files.*
 
Old 02-27-2015, 06:34 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Got bored while watching the soccer - how about something like this
Code:
awk -F "[[:space:]/]+" '/:/ $5 > 0 && $5 < 7 {sv1=$5 ; sv2 = substr($7,1,5) ; a[sv1][sv2]++ } ; /Settled/ {set[sv1][sv2]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files: ",a[i][j]"\n"i,j,"Settled: "set[i][j]}}' files.*
 
Old 02-27-2015, 07:54 AM   #9
cave_dweller
LQ Newbie
 
Registered: Feb 2015
Location: South Wales
Distribution: Various
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
Got bored while watching the soccer - how about something like this
Code:
awk -F "[[:space:]/]+" '/:/ $5 > 0 && $5 < 7 {sv1=$5 ; sv2 = substr($7,1,5) ; a[sv1][sv2]++ } ; /Settled/ {set[sv1][sv2]++} END{for (i in a) {for (j in a[i]) print i,j,"Total files: ",a[i][j]"\n"i,j,"Settled: "set[i][j]}}' files.*
Wow. It's gonna take a while to grok that, but many thanks! (I'm gonna assume the soccer wasn't great!).

I'll give it a whirl on some sample data and report back.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching & counting occurrences of words in multiple text files fantabulous Linux - Newbie 7 07-09-2014 05:17 PM
how do you rename multiple files based on strings in a text file? Holering Programming 9 01-08-2013 10:22 PM
renaming text files based upon a pattern in their content Spacepup Linux - General 1 07-28-2005 01:43 PM
Counting SLOC of multiple files uraja Linux - Newbie 1 09-12-2003 07:12 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration