-   Slackware (
-   -   CSV Files with Mixed Text and Numeric Data - Easy Way to Edit? (

tronayne 10-15-2013 11:55 AM

CSV Files with Mixed Text and Numeric Data - Easy Way to Edit?
I have a bunch of CSV files that are a mixture of text (and thus enclosed by double quotes) and numeric (they aren't). I need to combine two numeric columns into one (they're height and width, in inches, typically 10.5 and 7.5 respectively). Normally, I'd do that with AWK or maybe sed, but I'm kind of at a loss and would appreciate any advice on how-to doit toit. The files are all identical format, same columns, same title row, same data types (they're for batch loading data into a data base). The files are literally comma-separated, there are commas and vertical bars embedded in the data, and the "text" columns are all enclosed in double quotes.

I'm thinking that I might be able to do this with OpenOffice Calc, but I'm not sure if it can be done in batch (I'm looking, haven't found) which would be the berries, but, so far, no luck there (and I probably missed something simple).

Is there anything anybody knows of that does a good job of after-the-face editing of CSV files?

perbh 10-15-2013 01:01 PM

According to wikipedia (.csv):

The CSV file format is very simple and supported by almost all spreadsheets and database management systems. Many programming languages have libraries available that support CSV files. Many implementations support changing the field-separator character and some quoting conventions, although it is safest to use the simplest conventions, to maximize the recipients' chances of handling the data.

Microsoft Excel will open .csv files, but depending on the system's regional settings, it may expect a semicolon as a separator instead of a comma, since in some languages the comma is used as the decimal separator. Also, many regional versions of Excel will not be able to deal with Unicode in CSV. One simple solution when encountering such difficulties is to change the filename extension from .csv to .txt; then opening the file from an already running Excel with the "Open" command.

When pasting text data into Excel, normally the tab character is used as a separator: If you copy "hello<tab>goodbye" into the clipboard and paste it into Excel, it goes into two cells. "hello,goodbye" pasted into Excel goes into one cell, including the comma. If you use the functionality "text to columns" within Excel and change the settings, it changes the settings for pasting in text data as well. Calc and LibreOffice Calc handle CSV files and pasted text with a Text Import dialog asking the user to manually specify the delimiters, encoding, format of columns, etc.

There are many utility programs on Unix-style systems that can deal with at least some CSV files. Many such utilities have a way to change the delimiter character, but lack support for any other variations (or for Unicode). Some of the useful programs are:

uniq (-f to skip comparing the first N fields)

rg3 10-15-2013 01:38 PM

In these cases when things start to get a bit complex it's better to go for Python or Perl, IMHO. If I understood you properly, this Python program prints the list of fields in each line, and can be extended from there:

Edit: dicarded old Python version as there's a native csv module that handles these just fine.


#!/usr/bin/env python
import sys
import csv

with open(sys.argv[1], 'r') as stream:
    reader = csv.reader(stream, delimiter=',', quotechar='"')
    for row in reader:
        print row

The "row" variable will hold a list of strings that can be converted to numbers or not. This sample data:


"one, two, three",10.5,7.5
"four, five, six",11.5,6.5

Produces the following output (note: Python is printing the lists raw here with brackets, and surrounding each string with single quotes):


$ ./ data.csv
['one, two, three', '10.5', '7.5']
['four, five, six', '11.5', '6.5']

55020 10-15-2013 01:38 PM

The answer depends on the tools you are comfortable with, how many files you have, and whether you will be doing this again in the future.

Plan A: Spreadsheets will happily import quoted numeric fields, even if they don't export them. So start by using sed to wrap all the numeric fields in quotes as if they were text fields. Then you can use awk, or cut/join, using "," (quote-comma-quote) as the consistent field delimiter, just as you suggested. If there are no embedded doubled-up quotes within your text fields (eg. "Title: ""Hamlet""",1600,...) then this will be quite easy, but a truly general and repeatable solution would need really painful regular expressions.

Plan B: Openoffice, as you suggest. If you work with people who do heavy Excel work, they will probably be comfortable with this solution, and may even want to help you with macros and templates. This would be a general solution that you could hand over to other people, if it's a long term requirement.

Plan C: Perl. There is a CPAN module to read CSV files, it handles quoted fields just fine. Search the web for "perl csv" and you will be able to find lots of examples of how it is used. Because you are comfortable with awk, you will be probably be able to pick up 'just enough' Perl to adapt those examples for what you need to do, using Google and the Camel Book. This is a good plan if you always wanted to learn some Perl ;-) I hate Perl, but this is what I do sometimes :O

Plan D: Read it into an sql database, munge it with sql, export it back to csv. Even if you know lots of sql already, this probably won't work (sqlite, for example, doesn't support quoted text fields in csv import).

I've had this problem myself, more than once, and have tried everything except Plan D :-(
Am I the only one who sees a great need for a "text only" pseudospreadsheet app, scriptable, with csv as its native data format?

perbh 10-15-2013 03:02 PM

Can you give us a guesstimate over 'no of fields' and max length of a field?

btw - if you google for 'awk +csv' there are a couple of csv-decoders there - using awk ...

tronayne 10-16-2013 09:03 AM

Wow. Who knew there were so many ways to skin a cat... well, I kind of did 'cause I've been doing Unix/Linux for a long, long time and skinning cats is our business, eh?

As it happens, I've been dealing with CSV files for decades for import into a data base -- those have (almost all) been tab-delimited. I simply change the tabs to vertical bars, do a quick "filter" with C or AWK and load what I need to with the vertical bar delimiter (that works with Infomix, which defaults to vertical bar delimiters, MySQL and PostgreSQL lickity split. The problem here is that the application bulk loader utility will only process comma delimiters and quoted text fields and there's no way to change that (and I don't want to write my own loader because the application data base is spread out all over the place and it's just not practical).

These CSV files have been created transcribing paper records into a spread sheet; folks doing the transcribing have a template, Calc or Excel, that they use to do the entries, those files get sent with e-mail or arrive on a flash drive and get massaged a little to prepare them for bulk loading (that's a little sed operation).

Here I have to combine two columns into one. No-brainer until I tried it. Arrgghh! Sure, I can do it in Calc but I want to avoid doing it by hand. The only thing I know about perl is that it's a pain in the butt.

@55520, your suggestion about importing into a data base is feasible (no problem with SQL here). Might give that a shot and see what's what. Perl is probably out because I know absolutely nothing about programming in Perl (or Python for that matter).

@perbh, your suggestion about "awk + csv" seems the most feasible, I know AWK, it's easy, and I'm pretty sure that will do the trick... well, maybe, eh?

The CSV files aren't terribly big (less that 1,000 rows, some as few as 10 rows). Lots of text with embedded new lines (the Ctrl-Enter trick), lots of embedded || to import, in this case, more than one author/illustrator of a book (it's all books). The data do not have to be in any particular order (the title row determines where it goes), thus I can simply take one column, take the other column, make it pretty, tack it to the end and remove the two columns (or, with AWK, simply combine the two in place or something like that -- whatever I do the existing file will be written to a new file and that one will get loaded.

One lesson learned during a wasted youth is that CSV does not have to be quoted if the column delimiter does not appear in the data (that's why vertical bars are so handy and commas are not). The annoying part here is the insistence by the application bulk loader that the CSV file be comma delimited and text columns be quoted.

Thanks to all for the help, I do appreciate it (and I get to learn something every day).

catkin 10-16-2013 09:32 AM

Going from the OP information, these CSVs are well formatted in the sense that any commas in the data are within single quotes so commas outside single quotes can reliably be treated as column separators.

That being the case, you may like to look at the awk script attached to my blog post It parses such a regular CSV (an Outlook contacts format CSV, including line ends within the double quoted data) and re-formats it into a CSV for import to OTRS.

While it is very specific to the input and output specifics, it could be adapted to your requirement.

gnashley 10-16-2013 10:39 AM

"skinning cats is our business, eh?" -well yes, but only after they have been herded to the 'skinnery'.

tronayne 10-16-2013 12:56 PM


Originally Posted by catkin (Post 5046878)
Going from the OP information, these CSVs are well formatted in the sense that any commas in the data are within single quotes so commas outside single quotes can reliably be treated as column separators.

Nicely done and I appreciate the sample.

Thing is, I think I've finally figureed out what to do (so I don't have to deal with comma-delimited fields with and without double-quoted text fields (they're not all text, some are integer, some are floating-point, some are ISO dates, etc.).

I have 26 fields. Numbers 6 and 7 are the ones I need to combine.

The spread sheet is either OpenOffice cacl (ods file) or Microsoft exel; the calc is from a template that is saved as a spread sheet which is later saved as a CSV (and similarly with excel). There are embedded line feeds in some text fields. Could skip the save-as ods and just save-as CSV but there's some stuff that gets added to the later (like changing the headers from something like "author" to "" which sort of confuses non-techie folks that are doing the transcribing from paper to confuser).

Essentially, saving these as CSV with tab delimiters makes it a read-write job where, when I see the 5th tab, do special stuff with that field and the next field then continue read-write, perhaps changing the tabs to commas. No parsing with AWK, just a simple little C program to handle the combined fields and the embedded line feeds. It's a one-off that will get this job done (can't use getc because it stops scanning at a line feed, read doesn't). Won't be terribly efficient but what the heck, read a byte, write a byte, fiddle at the right place, read a byte, write a byte.

Thanks again for the input.

artificiosus 10-16-2013 04:14 PM

It seems you have your solution, but I tend to have good luck manipulating CSV files in R.

datain.csv contains:
"some stuff",10.7,7.5
"orange red",8.3,6.3
"blue green",6.5,2.5

Then in R:
> data<-read.csv("datain.csv")
> colnames(data)[2]<-"Dimensions"
> data$Dimensions<-paste(data$Dimensions,data$Width,sep=' by ')
> data<- subset(data, select = -Width)
> write.csv(data, "dataout.csv", row.names=F)

dataout.csv contains:
"some stuff","10.7 by 7.5"
"orange red","8.3 by 6.3"
"blue green","6.5 by 2.5"

tronayne 10-17-2013 08:11 AM


Originally Posted by artificiosus (Post 5047114)
It seems you have your solution, but I tend to have good luck manipulating CSV files in R.

Now, that's one I never would have thought of -- pretty darn slick. Gonna keep it around for some dang thing that I just know will pop up sooner or later, thank you.

What I actually did was grab my standard C template (bare code with all the command line arguments, file declarations, etc. but doesn't actually do anything and added this:

                /*      read one byte from the file    */
                c = 0;
                while ((n = fread (buf, sizeof (char), 1, in))) {
                        if (buf[0] == '\t') {
                        /*      are we at the width field      */
                        if (c == 6) {
                                if (width) {
                                        (void) fprintf (stdout, " x ");
                                        width = FALSE;
                        (void) fprintf (stdout, "%s", buf);
                        if (c >= 25) {
                                c = 0;
                                width = TRUE;
                (void) fprintf (stdout, "c = %d\n", c);
                /*      close the input file            */
                if (fclose (in))
                        (void) fprintf (stderr,
                            "%s:\tcan't close %s\n",
                            argv [0], argv [optind]);

Got the the job done, maybe not the most efficient way, but it got the job done.

Do like the look of that R thing, though and I'm going to play with it a little and see.

Got 'er done, thanks to all for the helpful suggestions (and ways and means).

All times are GMT -5. The time now is 12:57 PM.