How can I strip white space from the start and end of fields using awk?

jonnymorris · 10-01-2008, 10:46 AM

Hi,

I'm hoping some of you know the answer to this, what is probably quite a trivial thing but seems to be eluding me.

I am developing an awk script to pull specific fields from a csv file, but it also has to remove preceeding and trailing white space from each field.

Here is an example of one record:

Quote:

12345 , , , , , 12, Data Street, Command Deck, Enterprise, Space, 17094

(there are actually many spaces in this, but the forum seems to remove them automatically)

I would like it to be like this:

Quote:

12345,12 Data Street,Command Deck,Enterprise,Space,17094

So far my code strips off ALL of the white space so I end up with 12DataStreet etc. I'm OK with replacing , with a space in the output, that's just basic print stuff.

Code:

BEGIN {FS = ",";
		print "Finding good records..." 
	}

$36 ~ /N/ { 
			gsub(/ +/, "");
			gsub(/"/, "");
			print $1 "," $38 " " $39 "," $40 "," $41 "," $42 "," $43 "," $44 "," $59 "," $55 " " $56 " " $57 > FILENAME "_good.csv"; }

END { print "Finished!  Output sent to '"FILENAME"_good.csv'"}

Something else I seem to be unable to get right is the filename bit - I cannot strip off the .csv part of the original and replace it with my string (because each filename should retain it's uniqueness) without something weird going on - I seem to end up with two output files, one with 0 in front of the filename, the other with 1, 0 contains just one line (what was the heading line in the spreadsheet), 1 contains the data. What would be the correctway to chop up a FILENAME and append my own bit to the end? There are parts of the filename at the beginning that I do not want either, they vary too which is the annoying part.

Any help would be much appreciated!

jan61 · 10-01-2008, 01:15 PM

Moin,

the space problem: Your gsub does not differ between trailing, embedded and leading spaces. There are different ways to do it:

Code:

# first example: replace ", " and " ," with ","
gsub(/, /, ",");
gsub(/ ,/, ",");
# second example: replace leading and trailing space field by field
for (i = 1; i<= NF; i++) {
  gsub(/^ /, "", $i);
  gsub(/ $/, "", $i);
}

To your filename problem: I would create the new filename using a function in awk. You have to use gensub instead of gsub, because gsub modifies the original string and returns only the number of substitutions:

Code:

NR == 1 { new_fname = gensub(/\.csv$/, "", 1, FILENAME) "_good.csv"; }
{ ...
print ... >new_fname;
}

Jan

Mr. C. · 10-02-2008, 12:43 AM

Let awk's regular expression field separator work for you:

Code:

$ cat data    
12345 , , , , , 12 , Data Street,   Command Deck   , Enterprise, Space, 17094 

$ cat scr.awk 
BEGIN {
    FS="[ \t]*,[ \t]*";
    OFS=":";
    i=1;
}

{
  for (i=1; i <= NF; i++) {
      printf "%2d '%s'\n", i, $i;
  }
}

$ awk -f scr.awk data    
 1 '12345'
 2 ''
 3 ''
 4 ''
 5 ''
 6 '12'
 7 'Data Street'
 8 'Command Deck'
 9 'Enterprise'
10 'Space'
11 '17094'

jonnymorris · 10-02-2008, 03:49 AM

Thank you both for your excellent suggestions and help. I used a combination of your ideas:

Code:

BEGIN {FS="[ \t]*,[ \t]*";
		print "Finding good records..." 
	}
NR == 1 { new_fname = gensub(/\.csv$/, "", 1, FILENAME) "_good.csv"; }

$36 ~ /N/ { 
			
		for (i = 1; i<= NF; i++) {
			gsub(/^ /, "", $i);
			gsub(/ $/, "", $i);
			gsub(/"/, "", $i);
			}
			
		print $1 "," $38 " " $39 "," $40 "," $41 "," $42 "," $43 "," $44 "," $59 "," $55 " " $56 " " $57 > new_fname; }

END { print "Finished!  Output sent to '"new_fname"'"}

The FS="[ \t]*,[ \t]*"; part removes the long blank spaces after the first field (a reference number), the for loop does the spaces at start and end of each field (and also removes all occurances of ").

Thanks also for the filename fix! That works rather well.

A few more tweaks and that should be this script finished. I need to catch some records that have extra fields in the middle and adjust the print statement accordingly, nothing too difficult (one hopes). If I get stuck I'll be back!

Thanks again.

jlinkels · 10-02-2008, 07:27 PM

awk is not good for CSV strings. Sooner or later you'll find a string complying with the CSV format which crashes the AWK script.

I have succesfully been using this little c program.

jlinkels

chrism01 · 10-02-2008, 07:44 PM

Or Perl with Text::CSV module

sundialsvcs · 10-02-2008, 10:10 PM

When you use regular-expressions, there are just a couple of features that you need to keep in mind:

The "^" and "$" symbols refer to the start and the end of the string, respectively. So, if you want to remove only leading-blanks, you could look for "start-of-string followed by one-or-more whitespace characters." And so on.
Regular-expression substitution can be specified to occur "globally" (replace all occurrences) or "do it only once."

ghostdog74 · 10-02-2008, 10:52 PM

if you have PHP and your data don't have quotes

Code:

<?php
$file = "file.csv";
$handle = fopen("file","r");
while ( ($data = fgetcsv($handle,4096,",")) !== FALSE){
    $data=array_filter($data);
    $data = array_values($data);
    $data=implode(",",$data);
    echo "data: $data";
}
fclose($handle);

output

Code:

# more file.csv
12345 , , ,, , 12, Data Street, Command Deck  , Enterprise, Space, 17094
# php5 test.php
data: 12345 ,12,Data Street,Command Deck  ,Enterprise,Space,17094