LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 04-05-2008, 08:29 AM   #1
jsstevenson
LQ Newbie
 
Registered: Jul 2006
Location: Scotland
Distribution: Gentoo
Posts: 21

Rep: Reputation: 0
How to parse text file to a set text column width and output to new text file?


Hello...

I have a text file split into paragraphs. This is fine when viewing the text file in a program that has word wrap capabilities however I am using a program called QCAD to produce architectural drawings and find that formatting text (simply in terms of text column width) is very primitive.

What I would like to be able to do is take a text file containing paragraphs and to parse through the text and introduce a carriage return every x (say 80) characters which would then be output to a new text file formatted as required. The operation would be delimited by paragraphs and obviously would be applied to each paragraph one after the other.

Unlike various scripts I have 'messed' about with a want the text file to remain left aligned.

Any help, comments and/or pointers would be very much appreciated.

Thanks
Jamie.

Last edited by jsstevenson; 04-05-2008 at 08:50 AM.
 
Old 04-05-2008, 09:04 AM   #2
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
So, what exactly fmt does wrong for you?
 
Old 04-05-2008, 09:59 AM   #3
jsstevenson
LQ Newbie
 
Registered: Jul 2006
Location: Scotland
Distribution: Gentoo
Posts: 21

Original Poster
Rep: Reputation: 0
Solved

Thank you.

fmt was the hint I needed - it does indeed do what I asked for and on looking at documentation related to it found myself pointed to another script (fold) that produces the exact output I was looking for.

So easy when someone knows how!!!

Thanks again.
 
Old 04-21-2008, 02:36 AM   #4
KentMan
LQ Newbie
 
Registered: Apr 2008
Posts: 3

Rep: Reputation: 0
light at the end of the tunnel?

Hi Folks,

I got really excited when I saw the title of this topic ... and then realised that although it sounded like what I was looking for, sadly it was not. However ... it's just possible you guys can save my sanity !!

In my team we receive dozens of files every week. Generally speaking they are comma delimited files (sometimes with a double inverted comma text qualifier) The application that we map them in to isn't very smart ... so ... What I'm looking for (and cant seem to find anywhere) is a utility that will

a) let me give it the filename and what the delimiter is, and text qualifier if there is one.
b) the utility will then scan through the file using the parameters i gave it.
c) finally, the utility will output the number of columns found and (here's the real target) the maximum width of each column found.

This would make my life sooooo much easier!!

This seems to be such a basic task, and I'm sure I'm not the first person to want a tool like this. I spent literally 2 hours on google yesterday looking for a tool or command line app (Windows,Dos or Unix) All I could find was a program called UltraEdit that hinted that maybe it could do this. Alas even if it can, my employer is not going to spend $$ on an editor simply to find column widths.

So here I am, rocking backwards and forwards in the corner, mumbling to myself " ..why is this so hard ..?? .."

please help.
 
Old 04-21-2008, 11:40 AM   #5
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
KentMan, that's the sort of thing you probably won't find a stand alone program for because it's easy to cobble together with the standard unix command line tools.

Awk would be a good tool for this sort of thing, and is available on almost any modern Linux distro you care to name, not to mention lots of other unix-like OSes.

Some questions:
  1. When you say text qualifier,what do you mean?
  2. What should the output look like? (If you can provide a short example input and the expected output that would be perfect).

Last edited by matthewg42; 04-21-2008 at 03:29 PM.
 
Old 04-21-2008, 12:21 PM   #6
KentMan
LQ Newbie
 
Registered: Apr 2008
Posts: 3

Rep: Reputation: 0
Thanks for your quick reply. Here are the answers to your questions.

# When you say text qualifier,what do you mean?

A column that contains text values will often have its value surrounded by inverted commas.

4 row Example

"CODE1","KentMan",200
"CODE2","KentMan hates text files",2000
"CODE3","KentMan likes bacon",201
"CODE4","KentMan wants beer",45000

the entry in the last column does not have "" around it because it is a numeric value. The others are text values and so have text qualifiers around them.

A text qualifier is often used where the value can also contain the delimiter. (real pain when this happens) so for example, if the value in column 2 of my example was actually "Kent,Man", the text qualifier tells the import routine to ignore any delimiters found between the quotes. If the person putting the file together had any sense however, they would be tab or pipe delimited to avoid this issue (ideal world scenario I know)

# What should the output look like? (If you can provide a [i]short[i] example input and the expected output that would be perfect

Ideal output would look like this (using the example above as the input)

--------------
1,5
2,24
3,5

3 columns found
4 rows tested
--------------

the first number is the column position, and the second number is the maximum width found. Then two lines confirming how many columns it found and how many rows it tested (these are for re-assurance) process should be able to handle a few thousand rows. If the file was bigger than say 5k rows i would use something like the Unix head command to extract a sizeable sample from the top of the file for testing.

I hope I explained ok, please get back to me if you have more questions.

KM

Last edited by KentMan; 04-21-2008 at 12:27 PM.
 
Old 04-21-2008, 04:28 PM   #7
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 69
Quote:
Originally Posted by KentMan View Post
Ideal output would look like this (using the example above as the input)

--------------
1,5
2,24
3,5

3 columns found
4 rows tested
--------------
I think I get where the 24 comes from, but I don’t understand why the other two lines have 5. Shouldn’t it be 7 for row 1 and 19 for row 3? Or did I miss something?
 
Old 04-21-2008, 04:31 PM   #8
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
Osor: yes. He is talking about columns, not lines.
 
Old 04-21-2008, 04:58 PM   #9
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
Try this:
Code:
#!/usr/bin/awk -f

BEGIN {
        DELIM=",";
        TXTQUAL="\"";
        MAXCOL=0;
}

{
        # split fields, removing text qualifier and handling delimiter characters within
        # text qualifiers properly
        split("", result);      # clear the array result;
        in_qual = 0;
        field = 0;
        for(i=1; i<=length($0); i++) {
                c = substr($0, i, 1);
                if (c==TXTQUAL) {
                        in_qual = !in_qual;
                }
                else {
                        if (c==DELIM && ! in_qual) { field++; }
                        else { result[field] = result[field] c; }
                }
        }

        if (MAXCOL < (field+1)) {
                MAXCOL = field+1;
        }

        for(i=0; i<=field; i++) {
                if (FLDMAX[i] < length(result[i])) { FLDMAX[i] = length(result[i]); }
        }
}

END {
        for(i=0; i<MAXCOL; i++)
        {
                print (i+1) "," FLDMAX[i];
        }
        print "";
        print MAXCOL " columns found";
        print FNR " rows tested";
}
There may well be a lot more elegant ways to handle this in awk. I'm rusty

Save it to a file and set the permission to 755, e.g. if the file is called mytest.awk, and is in the same directory as the data file, called test_data, you would do this ($ is the prompt):
Code:
$ cd /where/your/files/are/located
$ chmod 755 mytest.awk
$ ./mytest.awk test_data
1,5
2,24
3,5

3 columns found
4 rows tested
If you intend to use the script a lot, for lots of different files where the delimiter and text qualifier might change, you should probably implement some sort of command line option handling to allow these to be set from the command line rather than editing the script all the time. This should be enough to get you started anyhow. You won't get blistering performance out of it, but it should work OK.

Last edited by matthewg42; 04-21-2008 at 05:10 PM. Reason: beauty
 
Old 04-21-2008, 09:23 PM   #10
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 69
Doing CSV processing manually in awk is pretty hard (especially if you want to have commas inside the fields or use escape sequences). It is easier to use something prewritten. For example, if you are fortunate enough to have access to Perl on your system, you may use the Text::CSV module. Here’s an example:
Code:
#!/usr/bin/perl -l

use strict;
use Text::CSV;

my @max;
my $csv = Text::CSV->new();

while(<>) {
	my @width = map length $_, $csv->fields() if $csv->parse($_);
	$max[$_] > $width[$_] or $max[$_] = $width[$_] for 0..$#width;
}

print "$_,$max[$_-1]" for 1..@max;
print "\n", scalar(@max), " columns found";
print "$. rows tested";
Simply save as a file with executable bit set, and then pass the file you desire as a command-line parameter or as stdin (same as the example above this).
 
Old 04-22-2008, 02:38 AM   #11
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
Quote:
Originally Posted by osor View Post
Doing CSV processing manually in awk is pretty hard (especially if you want to have commas inside the fields or use escape sequences).
It's not hard at all if you don't worry about text qualifiers.
Quote:
It is easier to use something prewritten. For example, if you are fortunate enough to have access to Perl on your system, you may use the Text::CSV module.
When I came to implement the example above, I wished that I had mentioned perl in my earlier response, but it's really not that hard in awk. It's true that expanding it so that it handles escaped quotes within quotes is again awkward (haha, awkward - get it?)...

However, even though perl itself is pretty much ubiquitous on modern Linux distros, the standard set of modules is a little limited. If one has to reply on software which is already installed on the system, awk has the advantage of being more uniformly available, although the GNU version is not be as common on commercial unixes as one might hope.

Having said all that, I choose perl 9 times out of ten over awk, because it's generally faster, and I am better practices with it.
 
Old 04-22-2008, 02:45 AM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 11,806

Rep: Reputation: 923Reputation: 923Reputation: 923Reputation: 923Reputation: 923Reputation: 923Reputation: 923Reputation: 923
I continue to be amazed with what some people do with (and to) the *nix toolset - especially perl.
 
Old 04-23-2008, 02:36 PM   #13
KentMan
LQ Newbie
 
Registered: Apr 2008
Posts: 3

Rep: Reputation: 0
Well .. what can I say ... incredible.

Unfortunately I dont have access to Perl, but I do have gawk.

Matthewg42, your script is exactly the tool I have been looking for. Not only that, it rattles through 5000 rows in a blink, and I have yet to test it with a file it gets wrong. A method of passing the delimiters would be the next logical step, but I'm not gonna push my luck and ask you to code it. I just wanted to drop you a note to let you know how much I appreciate your work. Yours too osor, but unfortunately perl is not an option for me

The rubber room will have to wait for me, my sanity is recovered !!

Many many thanks !!

KM

Last edited by KentMan; 04-24-2008 at 12:49 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Combine multiple one column text file into one text file with multiple colum khairilthegreat Linux - Newbie 7 11-23-2007 01:31 PM
in Pascal: how to exec a program, discard text output or send to text file Valkyrie_of_valhalla Programming 6 05-02-2007 09:50 AM
set path with output from text file. jstephens84 Slackware 3 12-04-2006 12:49 PM
ripping a column from a text file dominant Linux - Newbie 1 01-31-2006 04:15 AM
loading from text file using column width spyghost Programming 5 11-03-2003 01:34 PM


All times are GMT -5. The time now is 04:06 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration