[SOLVED] Parsing a file?

elfoozo · 03-23-2010, 12:15 PM

I'm thinking I want to use some form of code to make a single pass at /var/log/mail.log which is usually under 150 MB and do pattern matching to append output to separate new *.txt files.

To me, the "best" way is the fastest. Right now I'm thinking grep? But I'm looking for suggestions on better ways to tackle this? For example, is a single pass on a file programmatically better than chunking up multiple passes looking for the discrete elements?

rweaver · 03-23-2010, 12:35 PM

Grep is fine if you're pulling a couple distinct elements and want to discard the rest, otherwise I'd say you're better to move into perl.

winni · 03-23-2010, 12:40 PM

Quote:

Originally Posted by elfoozo

I'm thinking I want to use some form of code to make a single pass at /var/log/mail.log which is usually under 150 MB and do pattern matching to append output to separate new *.txt files.

To me, the "best" way is the fastest. Right now I'm thinking grep? But I'm looking for suggestions on better ways to tackle this? For example, is a single pass on a file programmatically better than chunking up multiple passes looking for the discrete elements?

Hi,

I would suggest using perl. perl ist very fast and powerful in parsing lines e.g. to distribute and reformat lines to make them more readable for humans.

Hope this helps,
Winfried

elfoozo · 03-23-2010, 01:01 PM

OK, two recommendations for perl so far, I can live with that. Is doing a single pass best practice? Or should I be sweeping the file multiple times to glean the criteria I am after? Or should I be creating TMP files with subsets of data and assembling those into a finished file? I know each can be done, I'm just wondering what is proper etiquette for future changes and maintainability.

H_TeXMeX_H · 03-23-2010, 01:01 PM

grep should work fine, if you require something more complex, then you can go for perl, but probably grep is faster. Single pass will be much faster.

SaintDanBert · 03-23-2010, 02:06 PM

Quote:

Originally Posted by elfoozo

I'm thinking I want to use some form of code to make a single pass at /var/log/mail.log which is usually under 150 MB and do pattern matching to append output to separate new *.txt files.

Are you wanting to grab specific lines or fragments associated with an individual message: start...this ... that ... end ... and so on?

Are you wanting to see what exists in your file right now or do you want a report on whatever might be in the entire file?

grep might work well for a one time grab of lines matching some pattern.

If you want to scan this log file routinely, and slice and dice its content into a periodic report, I'd suggest sed and awk for the totally geek effort. perl might offer a more satisfying implementation for a manage-my-workstation application.

Cheers,
~~~ 8d;-Dan

elfoozo · 03-23-2010, 02:19 PM

I haven't fully defined in my head all criteria I'm going to parse yet but the idea is to produce a collection of files that are less overwhelming to consume than the raw mail log. One file might contain all items blocked from a domain. Another might contain all items accepted for one domain or a user. They will also have totals... stuff like that.

SaintDanBert · 03-23-2010, 02:21 PM

Quote:

Originally Posted by elfoozo

...
should I be sweeping the file multiple times to glean the criteria I am after? Or should I be creating TMP files with subsets of data and assembling those ...

We might be able to make more effective recommendations, if we had an overview of what you want to accomplish. There might be an existing application [email has been around so long it is hard to imagine that something doesn't already do almost everything] that someone can tell you about.

Humans want and pay for benefits
Features deliver benefits
Components implement features
Applications are collections of required and optional components

What is the primary benefit of what you are wanting to accomplish?
I want to create a raw text (*.txt) file that contains XXXX and deliver that file to each of my end-users.

Which application features are needed to deliver that benefit?
I need a per-end-user file of raw text (*.txt) of their mail.log entries.

Which implementation details make each application feature possible?
I need to scan /var/log/mail.log and select XXX for each end-user.

How do I best implement those details?
(You might need to define "best" first, but ...) Perl [my opinion] will make it straight forward to

open a data file /var/log/mail.log
setup for each end-user
open a results file, mumble.txt
gather details for each end-user
close a results file
wrap-up this end-user and prepare for the next
reset data file for next end user
close the data file

elfoozo · 03-23-2010, 02:39 PM

SaintDanBert, I hear what you're saying, all excellent points. Problem is, I don't know what I don't know (yet).

From my perspective, the recommendations that have come so far have been helpful because without much scope I didn't "hear use C++", etc. Granted that might be what I need to do later on but I feel I have more than what I came to the forum with; a path to explore. From here I can begin to formulate better questions as I uncover what I could want to deliver.

ghostdog74 · 03-23-2010, 07:48 PM

Quote:

Originally Posted by elfoozo

SaintDanBert, I hear what you're saying, all excellent points. Problem is, I don't know what I don't know (yet).

From my perspective, the recommendations that have come so far have been helpful because without much scope I didn't "hear use C++", etc. Granted that might be what I need to do later on but I feel I have more than what I came to the forum with; a path to explore. From here I can begin to formulate better questions as I uncover what I could want to deliver.

if you have a big file, use grep+awk. grep for its fast pattern searching algorithm and awk for processing/manipulating text. Otherwise, just awk is enough for file processing. If you want to search a big file from the end, use tail.

ghostdog74 · 03-23-2010, 07:49 PM

Quote:

Originally Posted by winni

Hi,

I would suggest using perl. perl ist very fast and powerful in parsing lines e.g. to distribute and reformat lines to make them more readable for humans.

Hope this helps,
Winfried

to add to this list, so is Python, awk, grep, etc

SaintDanBert · 03-24-2010, 01:10 PM

Quote:

Originally Posted by elfoozo

...
I feel I have more than what I came to the forum with; a path to explore. From here I can begin to formulate better questions as I uncover what I could want to deliver.

If you want to create reports or similar from the raw log file,
you might load the raw file data into a "database" of some sort.
This might work really well if you might want to look at the details
historically: "Show me all rejects from May of ought-four".
I put the word "database" in quotes because you don't need the heavy
weight of a MySQL or Postgres to gain the benefits that data base
management software might offer to a data mining problem.

** read your log
** post to your "database"
** update your data index files
** search your "database" for details that match your current desire
** format a text file "report" based on your search results

(grinning) We'll always have more suggestions than you want to use,
~~~ 0;-Dan

chrism01 · 03-24-2010, 09:17 PM

Definitely sounds like a job for Perl; pattern matching & text mangling is what its really good at. You can probably do it in one pass as well, as you can open and close files on demand for each type of output.
Really it depends on waiting for you to decide what kind of output you want as to whether one or more passes is reqd eg overlapping rec sets ie recs that satisfy >1 criteria.
Just FYI, Perl is 'compiled-on-the-fly' so its pretty swift.
http://www.perl.com/doc/FMTEYEWTK/comp-vs-interp.html

rweaver · 03-26-2010, 03:25 PM

Based on what you're saying you're basically looking for informational reports generated from the mail logs on your system. I definitely suggest perl for this. If not perl then awk... but really, perl is the easiest route here and most expandable for future use with the least difficulty unless you're already an awk expert.