LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-01-2008, 10:46 AM   #1
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Rep: Reputation: 0
How can I strip white space from the start and end of fields using awk?


Hi,

I'm hoping some of you know the answer to this, what is probably quite a trivial thing but seems to be eluding me.

I am developing an awk script to pull specific fields from a csv file, but it also has to remove preceeding and trailing white space from each field.

Here is an example of one record:

Quote:
12345 , , , , , 12, Data Street, Command Deck, Enterprise, Space, 17094
(there are actually many spaces in this, but the forum seems to remove them automatically)

I would like it to be like this:

Quote:
12345,12 Data Street,Command Deck,Enterprise,Space,17094

So far my code strips off ALL of the white space so I end up with 12DataStreet etc. I'm OK with replacing , with a space in the output, that's just basic print stuff.

Code:
BEGIN {FS = ",";
		print "Finding good records..." 
	}

$36 ~ /N/ { 
			gsub(/ +/, "");
			gsub(/"/, "");
			print $1 "," $38 " " $39 "," $40 "," $41 "," $42 "," $43 "," $44 "," $59 "," $55 " " $56 " " $57 > FILENAME "_good.csv"; }

END { print "Finished!  Output sent to '"FILENAME"_good.csv'"}
Something else I seem to be unable to get right is the filename bit - I cannot strip off the .csv part of the original and replace it with my string (because each filename should retain it's uniqueness) without something weird going on - I seem to end up with two output files, one with 0 in front of the filename, the other with 1, 0 contains just one line (what was the heading line in the spreadsheet), 1 contains the data. What would be the correctway to chop up a FILENAME and append my own bit to the end? There are parts of the filename at the beginning that I do not want either, they vary too which is the annoying part.

Any help would be much appreciated!

Last edited by jonnymorris; 10-01-2008 at 10:48 AM.
 
Old 10-01-2008, 01:15 PM   #2
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 47
Moin,

the space problem: Your gsub does not differ between trailing, embedded and leading spaces. There are different ways to do it:
Code:
# first example: replace ", " and " ," with ","
gsub(/, /, ",");
gsub(/ ,/, ",");
# second example: replace leading and trailing space field by field
for (i = 1; i<= NF; i++) {
  gsub(/^ /, "", $i);
  gsub(/ $/, "", $i);
}
To your filename problem: I would create the new filename using a function in awk. You have to use gensub instead of gsub, because gsub modifies the original string and returns only the number of substitutions:
Code:
NR == 1 { new_fname = gensub(/\.csv$/, "", 1, FILENAME) "_good.csv"; }
{ ...
print ... >new_fname;
}
Jan
 
Old 10-02-2008, 12:43 AM   #3
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 63
Let awk's regular expression field separator work for you:

Code:
$ cat data    
12345 , , , , , 12 , Data Street,   Command Deck   , Enterprise, Space, 17094 

$ cat scr.awk 
BEGIN {
    FS="[ \t]*,[ \t]*";
    OFS=":";
    i=1;
}

{
  for (i=1; i <= NF; i++) {
      printf "%2d '%s'\n", i, $i;
  }
}

$ awk -f scr.awk data    
 1 '12345'
 2 ''
 3 ''
 4 ''
 5 ''
 6 '12'
 7 'Data Street'
 8 'Command Deck'
 9 'Enterprise'
10 'Space'
11 '17094'

Last edited by Mr. C.; 10-02-2008 at 12:44 AM.
 
Old 10-02-2008, 03:49 AM   #4
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks!

Thank you both for your excellent suggestions and help. I used a combination of your ideas:

Code:
BEGIN {FS="[ \t]*,[ \t]*";
		print "Finding good records..." 
	}
NR == 1 { new_fname = gensub(/\.csv$/, "", 1, FILENAME) "_good.csv"; }

$36 ~ /N/ { 
			
		for (i = 1; i<= NF; i++) {
			gsub(/^ /, "", $i);
			gsub(/ $/, "", $i);
			gsub(/"/, "", $i);
			}
			
		print $1 "," $38 " " $39 "," $40 "," $41 "," $42 "," $43 "," $44 "," $59 "," $55 " " $56 " " $57 > new_fname; }

END { print "Finished!  Output sent to '"new_fname"'"}
The FS="[ \t]*,[ \t]*"; part removes the long blank spaces after the first field (a reference number), the for loop does the spaces at start and end of each field (and also removes all occurances of ").

Thanks also for the filename fix! That works rather well.
A few more tweaks and that should be this script finished. I need to catch some records that have extra fields in the middle and adjust the print statement accordingly, nothing too difficult (one hopes). If I get stuck I'll be back!


Thanks again.

Last edited by jonnymorris; 10-02-2008 at 03:51 AM.
 
Old 10-02-2008, 07:27 PM   #5
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
awk is not good for CSV strings. Sooner or later you'll find a string complying with the CSV format which crashes the AWK script.

I have succesfully been using this little c program.

jlinkels
 
Old 10-02-2008, 07:44 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,362

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Or Perl with Text::CSV module
 
Old 10-02-2008, 10:10 PM   #7
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,671
Blog Entries: 4

Rep: Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945
When you use regular-expressions, there are just a couple of features that you need to keep in mind:
  1. The "^" and "$" symbols refer to the start and the end of the string, respectively. So, if you want to remove only leading-blanks, you could look for "start-of-string followed by one-or-more whitespace characters." And so on.
  2. Regular-expression substitution can be specified to occur "globally" (replace all occurrences) or "do it only once."
 
Old 10-02-2008, 10:52 PM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
if you have PHP and your data don't have quotes
Code:
<?php
$file = "file.csv";
$handle = fopen("file","r");
while ( ($data = fgetcsv($handle,4096,",")) !== FALSE){
    $data=array_filter($data);
    $data = array_values($data);
    $data=implode(",",$data);
    echo "data: $data";
}
fclose($handle);
output
Code:
# more file.csv
12345 , , ,, , 12, Data Street, Command Deck  , Enterprise, Space, 17094
# php5 test.php
data: 12345 ,12,Data Street,Command Deck  ,Enterprise,Space,17094
 
  


Reply

Tags
awk



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
As I right this, I have a large white strip on my screen from top to bottom Virtual Circuit Linux - Hardware 4 08-23-2008 02:27 AM
modify all fields in awk tostay2003 Programming 16 08-09-2008 01:41 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM
Supressing Fields w/ AWK Rv5 Programming 3 10-19-2004 11:06 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration