LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Parsing a comma separated CSV file where fields have commas in to (http://www.linuxquestions.org/questions/linux-general-1/parsing-a-comma-separated-csv-file-where-fields-have-commas-in-to-714332/)

trickyflash 03-25-2009 09:34 AM

Parsing a comma separated CSV file where fields have commas in to
 
I have a (comma separated) CSV file which has fields with text. Some of the text fields have commas in, and the CSV file puts (only) these fields in inverted commas.

From a command (bash) prompt, I wish to write a bash script to (for each line) separate values $columnA, $columnB etc. so that I can 'play' with them.
If I use comma as a delimiter, it doesn't work. If I use " as a delimiter, same thing.
I'm sure I did this before, and probably did it the long way, but really cannot for the life of me think of the best way to do this.

The end outcome is simple...
For each line (using for line in bla bla bla), I wish to capture variables reflective of the full value of each 'column' (CSV field).
Is there a simple way to do this? Am I being thick?!!!

I've just got a netbook, with Linux, and want to make my life simpler using shell scripts.

Help gratefully appreciated.

Richard

sundialsvcs 03-25-2009 10:26 AM

Easy answer: use Perl.

For instance, have a look at this page from Perl's CPAN library: http://search.cpan.org/~adamk/Parse-...b/Parse/CSV.pm.

What you see here is a very well-thought-out solution to this general problem, which you can avail yourself of very simply.

Or... maybe you don't want to learn anything about installing Perl stuff (a reasonable conclusion... you just want to get the job done). So how about PHP? You'll see a six-line example here: http://www.homeandlearn.co.uk/php/php10p6.html

The key point that I am making here is... you've got lots of high-level languages, probably already-installed on your Linux box. You can use any of them in the shell through the magic of the #! "shebang" mechanism. You are not confined to "BASH shell-scripting," and in this case you'd be using that facility for something other than its intended purpose ... never a good idea.

So, since you do have so many choices available to you, don't :tisk: waste your own time trying to figure out how to get a job done using an inappropriate tool. "Pointless academic exercises" are best left behind you, in your good-riddance memories of bygone college days. ;)

crabboy 03-25-2009 10:37 AM

I'd agree, bash is probably not the best tool to parse csv files. I saw this script, but it may need some help:

http://fixunix.com/unix/83523-how-do...sh-script.html

When I get to the point of having to use arrays in bash, I usually switch to another language. Here is a quick hacked up example of parsing csv in Java. Much cleaner and robust than the bash example above. This example uses the opencsv library for the parsing.

Code:

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;

import au.com.bytecode.opencsv.CSVReader;


public class ParseCSV
{
      public void loadCSV(String strFileName, char delimiter )
      {
        try
        {
            CSVReader reader = new CSVReader(new FileReader(strFileName), delimiter );
            List<String[]> csvEntries = reader.readAll();
            Iterator<String[]> Iter = csvEntries.iterator();

            while ( Iter.hasNext())
            {
              String [] strArray = (String[])Iter.next();
              for ( int i = 0; i < strArray.length; i++ )
              {
                  System.out.print("[" + strArray[i] + "]");
              }
              System.out.println("");
            }
        }
        catch ( FileNotFoundException fnfe )
        {
        }
        catch ( IOException ioe )
        {
        }

      }
     
  public static void main(String[] args)
  {
      ParseCSV p = new ParseCSV();
      p.loadCSV("file.csv", ',');
  }

}

File:
Code:

row1col1,"row1,col2","row1,col3",,row1col4,,,,"r,o,w,1,c,o,l,8"
row2col1,"row2,col2","row2,col3",,row2col4,,,,"r,o,w,2,c,o,l,8"

output:
Code:

[row1col1][row1,col2][row1,col3][][row1col4][][][][r,o,w,1,c,o,l,8]
[row2col1][row2,col2][row2,col3][][row2col4][][][][r,o,w,2,c,o,l,8]


jiml8 03-25-2009 10:54 AM

In PHP you have the explode command which handles all the parsing for you, splitting a string on the character you define and returning an array.

In C++, here is an implementation of explode:

Code:

std::vector<std::string> explode(std::string const& str, std::string const& delim)
{
  using std::string;
  string::size_type start(0);
  string::size_type end(string::npos);
  std::vector<std::string> sub;

  while( (end = str.find(delim,start)) != string::npos)
  {
      sub.push_back(str.substr(start, end - start));
      start = end + delim.size();
  }
  if(start < str.size() )sub.push_back(str.substr(start, str.size() - start));
  return sub;
}


theNbomr 03-25-2009 11:03 AM

Because of the possibility of fields containing embedded commas as you point out, this problem is elevated from the domain of that which is best done with a shell script. Specialized modules/packages for various language bindings exist, which are well crafted to do the job. I would be inclined to use Perl, and the Parse::CSV module or the Text::CSV::Simple module.
--- rod.

trickyflash 03-26-2009 05:42 AM

Firstly, thank-you all for your input.

I'm not familiar with (how to write in) Perl, and fairly poor on php, and am conscious of introducing extra complexity.
My background dates to DOS, hence the 'love' of 'bash' shell scripts to do my 'routine' stuff.

Sounds like I might have to get to learn another language in order to do something mundane!!
Surely there must be another way! (I'm sure I'll get to learn some php/perl at some point, but not sure now is the best time to begin - with 2 small daughters, time is not on my side !!)

So options:
1. PHP script (scary due to lack of knowledge, and also unknown how can make 'part php/part bash'
2. Perl script (scary for same reasons as 1.)
3. Think up a clever way with bash (and be able to re-use it elsewhere in other bash scripts)
4. Give up on scripting it, and do it manually (tedious)!

Another reason behind bash the rest of the script was in bash, and the csv bit was to capture the variables from each column for each line.

Any more thoughts/options to make my life easier ??

R.

On a separate note.. Not sure what the requirements are for php/perl etc. and think I could spend hours and hours setting it up for something I'm unlikely to use for anything else. (the netbook eee1000 does what I need for most things, but isn't the smartest netbook, as the build with Xandros is quite restricted, and missing lots of dependencies, has a dodgy sources.list file and makes adding apps quite painful)

ghostdog74 03-26-2009 06:54 AM

Quote:

Originally Posted by trickyflash (Post 3487262)
I have a (comma separated) CSV file which has fields with text. Some of the text fields have commas in, and the CSV file puts (only) these fields in inverted commas.

From a command (bash) prompt, I wish to write a bash script to (for each line) separate values $columnA, $columnB etc. so that I can 'play' with them.
If I use comma as a delimiter, it doesn't work. If I use " as a delimiter, same thing.
I'm sure I did this before, and probably did it the long way, but really cannot for the life of me think of the best way to do this.

a similar thread. you can get the awk script and play with it.
Code:

# awk -f awkcsv file
First Last,"Las Vegas, NV",80 -> 3|First Last|Las Vegas, NV|80|


trickyflash 03-26-2009 04:30 PM

Cool. I shall give it a whirl.

Something said on this forum jogged my memory, and I discovered my old script. Not as clean cut as the awk example above. A few grep's and sed's and cut's to read and change each line, and it had the same effect. Think I'll have a play with awk and figure out how to make best use of it. (and perhaps save learning php/perl for another year!!)

Thanks to all of you.

R.


All times are GMT -5. The time now is 05:56 PM.