LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 11-03-2008, 03:26 PM   #1
sharky
Member
 
Registered: Oct 2002
Posts: 384

Rep: Reputation: 37
Parsing a pseudo CSV file.


I have a raw data file that has double quotes around text fields and int and float fields without quotes.

an extremely simplified example:

"Job" 1 2 "some text" "" "" "some text"
"Job" 1 2 "some" "" "some text" ""

The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.

In perl I tried using a perl module, Text::CSV, but it was unable to parse some lines. I've also tried various field separators in awk and various split options in perl but can't get anything to work.

I think my only alternative is brute force parsing of each piece of text individually and building each field.
 
Old 11-03-2008, 04:12 PM   #2
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Just a thought: what about converting the proper spaces into commas, using sed? For example you may want every double quote followed by a space, transformed into a double quote followed by a comma. Or every digit followed by a space into that digit followed by comma, e.g.
Code:
sed -i 's/" /",/g' file
sed -i 's/ "/,"/g' file
sed -i 's/\([0-9]\) /\1,/g' file
sed -i 's/ \([0-9]\)/,\1/g' file
then you can parse the file as a CSV. Anyway, this method fails if some text field contains a digit surrounded by spaces or if a floating point number begin or ends with dot (for example .9 in place of 0.9 or 7. in place of 7.0). You can try to find out more specific regexp to avoid this problem.

Please note: as you probably already know, the -i option of sed edits the file in place. Do a backup copy of the original file before testing the above commands (if you want to give them a try).
 
Old 11-03-2008, 04:21 PM   #3
TB0ne
Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 14,192

Rep: Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469Reputation: 2469
Quote:
Originally Posted by sharky View Post
I have a raw data file that has double quotes around text fields and int and float fields without quotes.

an extremely simplified example:

"Job" 1 2 "some text" "" "" "some text"
"Job" 1 2 "some" "" "some text" ""

The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.

In perl I tried using a perl module, Text::CSV, but it was unable to parse some lines. I've also tried various field separators in awk and various split options in perl but can't get anything to work.

I think my only alternative is brute force parsing of each piece of text individually and building each field.
Been down that road before...ugly, too, and it doesn't lead to a nice neighborhood.


How about a low-tech solution?

1. Remove any existing commas (sed 's/,/ /g')
2. Put a comma in place of any empty fields (sed 's/\"\"/,/g')
3. Replace any double-quote/space combos with a comma (sed 's/\" /,/g')
4. Replace any space/double-quote combos with a comma (sed 's/ \"/,/g')
5. Strip out any remaining double-quotes (sed 's/\"//g')

That may leave you with something resembling a CSV that you could work with. And I'm positive there are better ways to SED that up, but I'm too lazy to look them up right now.
 
Old 11-03-2008, 04:32 PM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
No, just write a straightforward parser - not a big deal.
 
Old 11-03-2008, 04:57 PM   #5
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
Lightbulb

Quote:
Originally Posted by sharky
I have a raw data file that has double quotes around text fields and int and float fields without quotes.
...
The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.
[ Caveat: I'm a C novice (just learning in the last few days), so take this with a grain of salt. ]

I couldn't think of a way to solve this problem using my usual favorites (awk / sed / et al.), so this gave me a chance to hack together a little C program to accomplish the task.

Here it is, blank2comma.c:
Code:
#include <stdio.h>

/* This program will convert blank-delimited data to comma-delimited 
   data. Example: 
   ---
   "Line1" 90 "Data data" "More data here" 5
   --- converts to
   "Line1",90,"Data data","More data here",5                 */

#define IN   1
#define OUT  0
#define SIZE 2000

main() {

  int maxsize = SIZE;  
  int state = OUT;  
  int c; 
  int ctr = 0;

  while ((c=getchar()) != EOF) {

    ++ctr; 

    if (ctr > maxsize) {
      c = '\n';          /* Line wraps if we hit maxsize... */ 
      ctr = 0;
    }
    else if (state == OUT && c == ' ')
      c = ',';
    else if (state == IN && c == '"')
      state = OUT;
    else if (state == OUT && c == '"')
      state = IN;

    putchar(c);

  }

}
Example usage (after compiling...):
Code:
%cat data 
"Line1" "This is an example" 5 10
"Line2" "More data" 9 3242
"Line3" "OK" 0 0
"Line4" "" 100 99
"Line5" "Another example and then we're done" 50 2343

%gcc blank2comma.c

%cat data | ./a.out
"Line1","This is an example",5,10
"Line2","More data",9,3242
"Line3","OK",0,0
"Line4","",100,99
"Line5","Another example and then we're done",50,2343
Be sure to set SIZE to a large enough value, or you're going to get ugly results (wrapped lines). After you've run it through blank2comma, comma-separated data is easy to operate on. You could modify this to use pipes, or any other delimiter.

This is a hastily tested, crappy little thing I threw together, so don't bet the farm on it. I hope at least it gives you some different ideas.

Last edited by anomie; 11-03-2008 at 05:03 PM. Reason: couple clarifications
 
Old 11-03-2008, 05:16 PM   #6
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Good shot, anomie! It would be easy to translate your code in awk, but I believe the C code is faster.
 
Old 11-03-2008, 05:19 PM   #7
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
Quote:
Originally Posted by colucix
Good shot, anomie!
Well, IIRC you are a distro programmer (opensuse?), so I'll take it as a success. I hope it works out for OP (after adequate testing and data validation).
 
Old 11-03-2008, 05:46 PM   #8
sharky
Member
 
Registered: Oct 2002
Posts: 384

Original Poster
Rep: Reputation: 37
I almost got something to work.

cat nasty.file | sed 's/\" /\",/g' | sed 's/\([0-9]\) /\1,/g'

There were several snafus. One of my text field was like this;

"text && text11 "

This particular 'field' has a space before the double quote in every record. However, the 'text' in this particular case looks to be the same and it's not a critical field. I tried 's/text[0-9]+/text/g' but that didn't work.

I'm almost there.
 
Old 11-03-2008, 10:47 PM   #9
gjagadish
Member
 
Registered: Oct 2005
Location: India
Distribution: Fedora Core
Posts: 88

Rep: Reputation: 15
Hope this helps you !!!

echo 'Hello "How are you" "123123" 34234234 34 "text && text11 " 12345' | perl -ne '@fields = (/(?:^| )(".*?"|.*?)(?= |$)/g); foreach (@fields) { print "$_ | "; } print "\n"'

Hello | "How are you" | "123123" | 34234234 | 34 | "text && text11 " | 12345 |
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ldapsearch to csv file ralinux Linux - General 1 08-18-2008 12:34 PM
Comparing two csv files and write different record in third CSV file irfanb146 Linux - Newbie 3 06-30-2008 09:15 PM
Parsing XLS or CSV in Perl - what and what not me_the_apprentice Programming 13 02-28-2006 03:44 AM
Regarding Pseudo tty, Pseudo terminals ? mqureshi Programming 0 07-30-2005 10:51 AM
CSV File AMMullan Programming 2 11-10-2003 12:49 AM


All times are GMT -5. The time now is 04:59 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration