How to parse text file to a set text column width and output to new text file?

jsstevenson · 04-05-2008, 08:29 AM

Hello...

I have a text file split into paragraphs. This is fine when viewing the text file in a program that has word wrap capabilities however I am using a program called QCAD to produce architectural drawings and find that formatting text (simply in terms of text column width) is very primitive.

What I would like to be able to do is take a text file containing paragraphs and to parse through the text and introduce a carriage return every x (say 80) characters which would then be output to a new text file formatted as required. The operation would be delimited by paragraphs and obviously would be applied to each paragraph one after the other.

Unlike various scripts I have 'messed' about with a want the text file to remain left aligned.

Any help, comments and/or pointers would be very much appreciated.

Thanks
Jamie.

raskin · 04-05-2008, 09:04 AM

So, what exactly fmt does wrong for you?

jsstevenson · 04-05-2008, 09:59 AM

Thank you.

fmt was the hint I needed - it does indeed do what I asked for and on looking at documentation related to it found myself pointed to another script (fold) that produces the exact output I was looking for.

So easy when someone knows how!!!

Thanks again.

KentMan · 04-21-2008, 02:36 AM

Hi Folks,

I got really excited when I saw the title of this topic ... and then realised that although it sounded like what I was looking for, sadly it was not. However ... it's just possible you guys can save my sanity !!

In my team we receive dozens of files every week. Generally speaking they are comma delimited files (sometimes with a double inverted comma text qualifier) The application that we map them in to isn't very smart ... so ... What I'm looking for (and cant seem to find anywhere) is a utility that will

a) let me give it the filename and what the delimiter is, and text qualifier if there is one.
b) the utility will then scan through the file using the parameters i gave it.
c) finally, the utility will output the number of columns found and (here's the real target) the maximum width of each column found.

This would make my life sooooo much easier!!

This seems to be such a basic task, and I'm sure I'm not the first person to want a tool like this. I spent literally 2 hours on google yesterday looking for a tool or command line app (Windows,Dos or Unix) All I could find was a program called UltraEdit that hinted that maybe it could do this. Alas even if it can, my employer is not going to spend $$ on an editor simply to find column widths.

So here I am, rocking backwards and forwards in the corner, mumbling to myself " ..why is this so hard ..?? .."

please help.

matthewg42 · 04-21-2008, 11:40 AM

KentMan, that's the sort of thing you probably won't find a stand alone program for because it's easy to cobble together with the standard unix command line tools.

Awk would be a good tool for this sort of thing, and is available on almost any modern Linux distro you care to name, not to mention lots of other unix-like OSes.

Some questions:

When you say text qualifier,what do you mean?
What should the output look like? (If you can provide a short example input and the expected output that would be perfect).

KentMan · 04-21-2008, 12:21 PM

Thanks for your quick reply. Here are the answers to your questions.

# When you say text qualifier,what do you mean?

A column that contains text values will often have its value surrounded by inverted commas.

4 row Example

"CODE1","KentMan",200
"CODE2","KentMan hates text files",2000
"CODE3","KentMan likes bacon",201
"CODE4","KentMan wants beer",45000

the entry in the last column does not have "" around it because it is a numeric value. The others are text values and so have text qualifiers around them.

A text qualifier is often used where the value can also contain the delimiter. (real pain when this happens) so for example, if the value in column 2 of my example was actually "Kent,Man", the text qualifier tells the import routine to ignore any delimiters found between the quotes. If the person putting the file together had any sense however, they would be tab or pipe delimited to avoid this issue (ideal world scenario I know)

# What should the output look like? (If you can provide a [i]short[i] example input and the expected output that would be perfect

Ideal output would look like this (using the example above as the input)

--------------
1,5
2,24
3,5

3 columns found
4 rows tested
--------------

the first number is the column position, and the second number is the maximum width found. Then two lines confirming how many columns it found and how many rows it tested (these are for re-assurance) process should be able to handle a few thousand rows. If the file was bigger than say 5k rows i would use something like the Unix head command to extract a sizeable sample from the top of the file for testing.

I hope I explained ok, please get back to me if you have more questions.

KM

osor · 04-21-2008, 04:28 PM

Quote:

Originally Posted by KentMan

Ideal output would look like this (using the example above as the input)

--------------
1,5
2,24
3,5

3 columns found
4 rows tested
--------------

I think I get where the 24 comes from, but I don’t understand why the other two lines have 5. Shouldn’t it be 7 for row 1 and 19 for row 3? Or did I miss something?

raskin · 04-21-2008, 04:31 PM

Osor: yes. He is talking about columns, not lines.

matthewg42 · 04-21-2008, 04:58 PM

Try this:

Code:

#!/usr/bin/awk -f

BEGIN {
        DELIM=",";
        TXTQUAL="\"";
        MAXCOL=0;
}

{
        # split fields, removing text qualifier and handling delimiter characters within
        # text qualifiers properly
        split("", result);      # clear the array result;
        in_qual = 0;
        field = 0;
        for(i=1; i<=length($0); i++) {
                c = substr($0, i, 1);
                if (c==TXTQUAL) {
                        in_qual = !in_qual;
                }
                else {
                        if (c==DELIM && ! in_qual) { field++; }
                        else { result[field] = result[field] c; }
                }
        }

        if (MAXCOL < (field+1)) {
                MAXCOL = field+1;
        }

        for(i=0; i<=field; i++) {
                if (FLDMAX[i] < length(result[i])) { FLDMAX[i] = length(result[i]); }
        }
}

END {
        for(i=0; i<MAXCOL; i++)
        {
                print (i+1) "," FLDMAX[i];
        }
        print "";
        print MAXCOL " columns found";
        print FNR " rows tested";
}

There may well be a lot more elegant ways to handle this in awk. I'm rusty

Save it to a file and set the permission to 755, e.g. if the file is called mytest.awk, and is in the same directory as the data file, called test_data, you would do this ($ is the prompt):

Code:

$ cd /where/your/files/are/located
$ chmod 755 mytest.awk
$ ./mytest.awk test_data
1,5
2,24
3,5

3 columns found
4 rows tested

If you intend to use the script a lot, for lots of different files where the delimiter and text qualifier might change, you should probably implement some sort of command line option handling to allow these to be set from the command line rather than editing the script all the time. This should be enough to get you started anyhow. You won't get blistering performance out of it, but it should work OK.

osor · 04-21-2008, 09:23 PM

Doing CSV processing manually in awk is pretty hard (especially if you want to have commas inside the fields or use escape sequences). It is easier to use something prewritten. For example, if you are fortunate enough to have access to Perl on your system, you may use the Text::CSV module. Here’s an example:

Code:

#!/usr/bin/perl -l

use strict;
use Text::CSV;

my @max;
my $csv = Text::CSV->new();

while(<>) {
	my @width = map length $_, $csv->fields() if $csv->parse($_);
	$max[$_] > $width[$_] or $max[$_] = $width[$_] for 0..$#width;
}

print "$_,$max[$_-1]" for 1..@max;
print "\n", scalar(@max), " columns found";
print "$. rows tested";

Simply save as a file with executable bit set, and then pass the file you desire as a command-line parameter or as stdin (same as the example above this).

matthewg42 · 04-22-2008, 02:38 AM

Quote:

Originally Posted by osor

Doing CSV processing manually in awk is pretty hard (especially if you want to have commas inside the fields or use escape sequences).

It's not hard at all if you don't worry about text qualifiers.

Quote:

It is easier to use something prewritten. For example, if you are fortunate enough to have access to Perl on your system, you may use the Text::CSV module.

When I came to implement the example above, I wished that I had mentioned perl in my earlier response, but it's really not that hard in awk. It's true that expanding it so that it handles escaped quotes within quotes is again awkward (haha, awkward - get it?)...

However, even though perl itself is pretty much ubiquitous on modern Linux distros, the standard set of modules is a little limited. If one has to reply on software which is already installed on the system, awk has the advantage of being more uniformly available, although the GNU version is not be as common on commercial unixes as one might hope.

Having said all that, I choose perl 9 times out of ten over awk, because it's generally faster, and I am better practices with it.

syg00 · 04-22-2008, 02:45 AM

I continue to be amazed with what some people do with (and to) the *nix toolset - especially perl.

KentMan · 04-23-2008, 02:36 PM

Well .. what can I say ... incredible.

Unfortunately I dont have access to Perl, but I do have gawk.

Matthewg42, your script is exactly the tool I have been looking for. Not only that, it rattles through 5000 rows in a blink, and I have yet to test it with a file it gets wrong. A method of passing the delimiters would be the next logical step, but I'm not gonna push my luck and ask you to code it. I just wanted to drop you a note to let you know how much I appreciate your work. Yours too osor, but unfortunately perl is not an option for me

The rubber room will have to wait for me, my sanity is recovered !!

Many many thanks !!

KM