LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Parsing a pseudo CSV file. (http://www.linuxquestions.org/questions/programming-9/parsing-a-pseudo-csv-file-680894/)

sharky 11-03-2008 03:26 PM

Parsing a pseudo CSV file.
 
I have a raw data file that has double quotes around text fields and int and float fields without quotes.

an extremely simplified example:

"Job" 1 2 "some text" "" "" "some text"
"Job" 1 2 "some" "" "some text" ""

The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.

In perl I tried using a perl module, Text::CSV, but it was unable to parse some lines. I've also tried various field separators in awk and various split options in perl but can't get anything to work.

I think my only alternative is brute force parsing of each piece of text individually and building each field.

colucix 11-03-2008 04:12 PM

Just a thought: what about converting the proper spaces into commas, using sed? For example you may want every double quote followed by a space, transformed into a double quote followed by a comma. Or every digit followed by a space into that digit followed by comma, e.g.
Code:

sed -i 's/" /",/g' file
sed -i 's/ "/,"/g' file
sed -i 's/\([0-9]\) /\1,/g' file
sed -i 's/ \([0-9]\)/,\1/g' file

then you can parse the file as a CSV. Anyway, this method fails if some text field contains a digit surrounded by spaces or if a floating point number begin or ends with dot (for example .9 in place of 0.9 or 7. in place of 7.0). You can try to find out more specific regexp to avoid this problem.

Please note: as you probably already know, the -i option of sed edits the file in place. Do a backup copy of the original file before testing the above commands (if you want to give them a try).

TB0ne 11-03-2008 04:21 PM

Quote:

Originally Posted by sharky (Post 3330541)
I have a raw data file that has double quotes around text fields and int and float fields without quotes.

an extremely simplified example:

"Job" 1 2 "some text" "" "" "some text"
"Job" 1 2 "some" "" "some text" ""

The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.

In perl I tried using a perl module, Text::CSV, but it was unable to parse some lines. I've also tried various field separators in awk and various split options in perl but can't get anything to work.

I think my only alternative is brute force parsing of each piece of text individually and building each field.

Been down that road before...ugly, too, and it doesn't lead to a nice neighborhood. ;)


How about a low-tech solution?

1. Remove any existing commas (sed 's/,/ /g')
2. Put a comma in place of any empty fields (sed 's/\"\"/,/g')
3. Replace any double-quote/space combos with a comma (sed 's/\" /,/g')
4. Replace any space/double-quote combos with a comma (sed 's/ \"/,/g')
5. Strip out any remaining double-quotes (sed 's/\"//g')

That may leave you with something resembling a CSV that you could work with. And I'm positive there are better ways to SED that up, but I'm too lazy to look them up right now. ;)

Sergei Steshenko 11-03-2008 04:32 PM

No, just write a straightforward parser - not a big deal.

anomie 11-03-2008 04:57 PM

Quote:

Originally Posted by sharky
I have a raw data file that has double quotes around text fields and int and float fields without quotes.
...
The problem is that the file is space, rather than comma, delimited but some of the text fields also have a space. There are over 30 fields and corresponding 'columns' can have text with or without spaces. Empty fields are donated by a pair of double quotes whether that particular field would otherwise be text or not.

[ Caveat: I'm a C novice (just learning in the last few days), so take this with a grain of salt. ]

I couldn't think of a way to solve this problem using my usual favorites (awk / sed / et al.), so this gave me a chance to hack together a little C program to accomplish the task.

Here it is, blank2comma.c:
Code:

#include <stdio.h>

/* This program will convert blank-delimited data to comma-delimited
  data. Example:
  ---
  "Line1" 90 "Data data" "More data here" 5
  --- converts to
  "Line1",90,"Data data","More data here",5                */

#define IN  1
#define OUT  0
#define SIZE 2000

main() {

  int maxsize = SIZE; 
  int state = OUT; 
  int c;
  int ctr = 0;

  while ((c=getchar()) != EOF) {

    ++ctr;

    if (ctr > maxsize) {
      c = '\n';          /* Line wraps if we hit maxsize... */
      ctr = 0;
    }
    else if (state == OUT && c == ' ')
      c = ',';
    else if (state == IN && c == '"')
      state = OUT;
    else if (state == OUT && c == '"')
      state = IN;

    putchar(c);

  }

}

Example usage (after compiling...):
Code:

%cat data
"Line1" "This is an example" 5 10
"Line2" "More data" 9 3242
"Line3" "OK" 0 0
"Line4" "" 100 99
"Line5" "Another example and then we're done" 50 2343

%gcc blank2comma.c

%cat data | ./a.out
"Line1","This is an example",5,10
"Line2","More data",9,3242
"Line3","OK",0,0
"Line4","",100,99
"Line5","Another example and then we're done",50,2343

Be sure to set SIZE to a large enough value, or you're going to get ugly results (wrapped lines). After you've run it through blank2comma, comma-separated data is easy to operate on. You could modify this to use pipes, or any other delimiter.

This is a hastily tested, crappy little thing I threw together, so don't bet the farm on it. ;) I hope at least it gives you some different ideas.

colucix 11-03-2008 05:16 PM

Good shot, anomie! It would be easy to translate your code in awk, but I believe the C code is faster.

anomie 11-03-2008 05:19 PM

Quote:

Originally Posted by colucix
Good shot, anomie!

Well, IIRC you are a distro programmer (opensuse?), so I'll take it as a success. :) I hope it works out for OP (after adequate testing and data validation).

sharky 11-03-2008 05:46 PM

I almost got something to work.

cat nasty.file | sed 's/\" /\",/g' | sed 's/\([0-9]\) /\1,/g'

There were several snafus. One of my text field was like this;

"text && text11 "

This particular 'field' has a space before the double quote in every record. However, the 'text' in this particular case looks to be the same and it's not a critical field. I tried 's/text[0-9]+/text/g' but that didn't work.

I'm almost there.

gjagadish 11-03-2008 10:47 PM

Hope this helps you !!!

echo 'Hello "How are you" "123123" 34234234 34 "text && text11 " 12345' | perl -ne '@fields = (/(?:^| )(".*?"|.*?)(?= |$)/g); foreach (@fields) { print "$_ | "; } print "\n"'

Hello | "How are you" | "123123" | 34234234 | 34 | "text && text11 " | 12345 |


All times are GMT -5. The time now is 01:03 PM.