LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices

Reply
 
LinkBack Search this Thread
Old 10-15-2013, 11:55 AM   #1
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 2,858

Rep: Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697
CSV Files with Mixed Text and Numeric Data - Easy Way to Edit?


I have a bunch of CSV files that are a mixture of text (and thus enclosed by double quotes) and numeric (they aren't). I need to combine two numeric columns into one (they're height and width, in inches, typically 10.5 and 7.5 respectively). Normally, I'd do that with AWK or maybe sed, but I'm kind of at a loss and would appreciate any advice on how-to doit toit. The files are all identical format, same columns, same title row, same data types (they're for batch loading data into a data base). The files are literally comma-separated, there are commas and vertical bars embedded in the data, and the "text" columns are all enclosed in double quotes.

I'm thinking that I might be able to do this with OpenOffice Calc, but I'm not sure if it can be done in batch (I'm looking, haven't found) which would be the berries, but, so far, no luck there (and I probably missed something simple).

Is there anything anybody knows of that does a good job of after-the-face editing of CSV files?
 
Old 10-15-2013, 01:01 PM   #2
perbh
Member
 
Registered: May 2008
Location: Republic of Texas
Posts: 204

Rep: Reputation: 32
According to wikipedia (.csv):
Quote:
The CSV file format is very simple and supported by almost all spreadsheets and database management systems. Many programming languages have libraries available that support CSV files. Many implementations support changing the field-separator character and some quoting conventions, although it is safest to use the simplest conventions, to maximize the recipients' chances of handling the data.

Microsoft Excel will open .csv files, but depending on the system's regional settings, it may expect a semicolon as a separator instead of a comma, since in some languages the comma is used as the decimal separator. Also, many regional versions of Excel will not be able to deal with Unicode in CSV. One simple solution when encountering such difficulties is to change the filename extension from .csv to .txt; then opening the file from an already running Excel with the "Open" command.

When pasting text data into Excel, normally the tab character is used as a separator: If you copy "hello<tab>goodbye" into the clipboard and paste it into Excel, it goes into two cells. "hello,goodbye" pasted into Excel goes into one cell, including the comma. If you use the functionality "text to columns" within Excel and change the settings, it changes the settings for pasting in text data as well.

OpenOffice.org Calc and LibreOffice Calc handle CSV files and pasted text with a Text Import dialog asking the user to manually specify the delimiters, encoding, format of columns, etc.

There are many utility programs on Unix-style systems that can deal with at least some CSV files. Many such utilities have a way to change the delimiter character, but lack support for any other variations (or for Unicode). Some of the useful programs are:

column
cut
paste
join
sort
uniq (-f to skip comparing the first N fields)

Last edited by perbh; 10-15-2013 at 02:25 PM.
 
Old 10-15-2013, 01:38 PM   #3
rg3
Member
 
Registered: Jul 2007
Distribution: Slackware Linux
Posts: 498

Rep: Reputation: Disabled
In these cases when things start to get a bit complex it's better to go for Python or Perl, IMHO. If I understood you properly, this Python program prints the list of fields in each line, and can be extended from there:

Edit: dicarded old Python version as there's a native csv module that handles these just fine.

Code:
#!/usr/bin/env python
import sys
import csv

with open(sys.argv[1], 'r') as stream:
    reader = csv.reader(stream, delimiter=',', quotechar='"')
    for row in reader:
        print row
The "row" variable will hold a list of strings that can be converted to numbers or not. This sample data:

Code:
"one, two, three",10.5,7.5
"four, five, six",11.5,6.5
Produces the following output (note: Python is printing the lists raw here with brackets, and surrounding each string with single quotes):

Code:
$ ./sample.py data.csv 
['one, two, three', '10.5', '7.5']
['four, five, six', '11.5', '6.5']

Last edited by rg3; 10-15-2013 at 01:46 PM.
 
1 members found this post helpful.
Old 10-15-2013, 01:38 PM   #4
55020
Member
 
Registered: Sep 2009
Location: Yorks. W.R. 167397
Distribution: Slackware
Posts: 271
Blog Entries: 4

Rep: Reputation: 255Reputation: 255Reputation: 255
The answer depends on the tools you are comfortable with, how many files you have, and whether you will be doing this again in the future.

Plan A: Spreadsheets will happily import quoted numeric fields, even if they don't export them. So start by using sed to wrap all the numeric fields in quotes as if they were text fields. Then you can use awk, or cut/join, using "," (quote-comma-quote) as the consistent field delimiter, just as you suggested. If there are no embedded doubled-up quotes within your text fields (eg. "Title: ""Hamlet""",1600,...) then this will be quite easy, but a truly general and repeatable solution would need really painful regular expressions.

Plan B: Openoffice, as you suggest. If you work with people who do heavy Excel work, they will probably be comfortable with this solution, and may even want to help you with macros and templates. This would be a general solution that you could hand over to other people, if it's a long term requirement.

Plan C: Perl. There is a CPAN module to read CSV files, it handles quoted fields just fine. Search the web for "perl csv" and you will be able to find lots of examples of how it is used. Because you are comfortable with awk, you will be probably be able to pick up 'just enough' Perl to adapt those examples for what you need to do, using Google and the Camel Book. This is a good plan if you always wanted to learn some Perl ;-) I hate Perl, but this is what I do sometimes :O

Plan D: Read it into an sql database, munge it with sql, export it back to csv. Even if you know lots of sql already, this probably won't work (sqlite, for example, doesn't support quoted text fields in csv import).

I've had this problem myself, more than once, and have tried everything except Plan D :-(
Am I the only one who sees a great need for a "text only" pseudospreadsheet app, scriptable, with csv as its native data format?
 
2 members found this post helpful.
Old 10-15-2013, 03:02 PM   #5
perbh
Member
 
Registered: May 2008
Location: Republic of Texas
Posts: 204

Rep: Reputation: 32
@tronayne,
Can you give us a guesstimate over 'no of fields' and max length of a field?

btw - if you google for 'awk +csv' there are a couple of csv-decoders there - using awk ...

Last edited by perbh; 10-15-2013 at 03:03 PM.
 
1 members found this post helpful.
Old 10-16-2013, 09:03 AM   #6
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 2,858

Original Poster
Rep: Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697
Wow. Who knew there were so many ways to skin a cat... well, I kind of did 'cause I've been doing Unix/Linux for a long, long time and skinning cats is our business, eh?

As it happens, I've been dealing with CSV files for decades for import into a data base -- those have (almost all) been tab-delimited. I simply change the tabs to vertical bars, do a quick "filter" with C or AWK and load what I need to with the vertical bar delimiter (that works with Infomix, which defaults to vertical bar delimiters, MySQL and PostgreSQL lickity split. The problem here is that the application bulk loader utility will only process comma delimiters and quoted text fields and there's no way to change that (and I don't want to write my own loader because the application data base is spread out all over the place and it's just not practical).

These CSV files have been created transcribing paper records into a spread sheet; folks doing the transcribing have a template, Calc or Excel, that they use to do the entries, those files get sent with e-mail or arrive on a flash drive and get massaged a little to prepare them for bulk loading (that's a little sed operation).

Here I have to combine two columns into one. No-brainer until I tried it. Arrgghh! Sure, I can do it in Calc but I want to avoid doing it by hand. The only thing I know about perl is that it's a pain in the butt.

@55520, your suggestion about importing into a data base is feasible (no problem with SQL here). Might give that a shot and see what's what. Perl is probably out because I know absolutely nothing about programming in Perl (or Python for that matter).

@perbh, your suggestion about "awk + csv" seems the most feasible, I know AWK, it's easy, and I'm pretty sure that will do the trick... well, maybe, eh?

The CSV files aren't terribly big (less that 1,000 rows, some as few as 10 rows). Lots of text with embedded new lines (the Ctrl-Enter trick), lots of embedded || to import, in this case, more than one author/illustrator of a book (it's all books). The data do not have to be in any particular order (the title row determines where it goes), thus I can simply take one column, take the other column, make it pretty, tack it to the end and remove the two columns (or, with AWK, simply combine the two in place or something like that -- whatever I do the existing file will be written to a new file and that one will get loaded.

One lesson learned during a wasted youth is that CSV does not have to be quoted if the column delimiter does not appear in the data (that's why vertical bars are so handy and commas are not). The annoying part here is the insistence by the application bulk loader that the CSV file be comma delimited and text columns be quoted.

Thanks to all for the help, I do appreciate it (and I get to learn something every day).
 
Old 10-16-2013, 09:32 AM   #7
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,512
Blog Entries: 27

Rep: Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174Reputation: 1174
Going from the OP information, these CSVs are well formatted in the sense that any commas in the data are within single quotes so commas outside single quotes can reliably be treated as column separators.

That being the case, you may like to look at the awk script attached to my blog post http://www.linuxquestions.org/questi...ess-book-3917/. It parses such a regular CSV (an Outlook contacts format CSV, including line ends within the double quoted data) and re-formats it into a CSV for import to OTRS.

While it is very specific to the input and output specifics, it could be adapted to your requirement.
 
Old 10-16-2013, 10:39 AM   #8
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,724

Rep: Reputation: 449Reputation: 449Reputation: 449Reputation: 449Reputation: 449
"skinning cats is our business, eh?" -well yes, but only after they have been herded to the 'skinnery'.
 
1 members found this post helpful.
Old 10-16-2013, 12:56 PM   #9
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 2,858

Original Poster
Rep: Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697
Quote:
Originally Posted by catkin View Post
Going from the OP information, these CSVs are well formatted in the sense that any commas in the data are within single quotes so commas outside single quotes can reliably be treated as column separators.
Nicely done and I appreciate the sample.

Thing is, I think I've finally figureed out what to do (so I don't have to deal with comma-delimited fields with and without double-quoted text fields (they're not all text, some are integer, some are floating-point, some are ISO dates, etc.).

I have 26 fields. Numbers 6 and 7 are the ones I need to combine.

The spread sheet is either OpenOffice cacl (ods file) or Microsoft exel; the calc is from a template that is saved as a spread sheet which is later saved as a CSV (and similarly with excel). There are embedded line feeds in some text fields. Could skip the save-as ods and just save-as CSV but there's some stuff that gets added to the later (like changing the headers from something like "author" to "dc.contributor.author" which sort of confuses non-techie folks that are doing the transcribing from paper to confuser).

Essentially, saving these as CSV with tab delimiters makes it a read-write job where, when I see the 5th tab, do special stuff with that field and the next field then continue read-write, perhaps changing the tabs to commas. No parsing with AWK, just a simple little C program to handle the combined fields and the embedded line feeds. It's a one-off that will get this job done (can't use getc because it stops scanning at a line feed, read doesn't). Won't be terribly efficient but what the heck, read a byte, write a byte, fiddle at the right place, read a byte, write a byte.

Thanks again for the input.
 
Old 10-16-2013, 04:14 PM   #10
artificiosus
LQ Newbie
 
Registered: Mar 2010
Posts: 3

Rep: Reputation: 3
It seems you have your solution, but I tend to have good luck manipulating CSV files in R.

datain.csv contains:
"Text","Height","Width"
"some stuff",10.7,7.5
"orange red",8.3,6.3
"blue green",6.5,2.5

Then in R:
> data<-read.csv("datain.csv")
> colnames(data)[2]<-"Dimensions"
> data$Dimensions<-paste(data$Dimensions,data$Width,sep=' by ')
> data<- subset(data, select = -Width)
> write.csv(data, "dataout.csv", row.names=F)

dataout.csv contains:
"Text","Dimensions"
"some stuff","10.7 by 7.5"
"orange red","8.3 by 6.3"
"blue green","6.5 by 2.5"
 
2 members found this post helpful.
Old 10-17-2013, 08:11 AM   #11
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 2,858

Original Poster
Rep: Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697Reputation: 697
Quote:
Originally Posted by artificiosus View Post
It seems you have your solution, but I tend to have good luck manipulating CSV files in R.
Now, that's one I never would have thought of -- pretty darn slick. Gonna keep it around for some dang thing that I just know will pop up sooner or later, thank you.

What I actually did was grab my standard C template (bare code with all the command line arguments, file declarations, etc. but doesn't actually do anything and added this:
Code:
                /*      read one byte from the file     */
                c = 0;
                while ((n = fread (buf, sizeof (char), 1, in))) {
                        if (buf[0] == '\t') {
                                c++;
                        }
                        /*      are we at the width field       */
                        if (c == 6) {
                                if (width) {
                                        (void) fprintf (stdout, " x ");
                                        width = FALSE;
                                        continue;
                                }
                        }
                        (void) fprintf (stdout, "%s", buf);
                        if (c >= 25) {
                                c = 0;
                                width = TRUE;
                        }
                }
                (void) fprintf (stdout, "c = %d\n", c);
                /*      close the input file            */
                if (fclose (in))
                        (void) fprintf (stderr,
                            "%s:\tcan't close %s\n",
                            argv [0], argv [optind]);
                optind++;
        }
Got the the job done, maybe not the most efficient way, but it got the job done.

Do like the look of that R thing, though and I'm going to play with it a little and see.

Got 'er done, thanks to all for the helpful suggestions (and ways and means).
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to script csv editing? Remove rows from csv file that do not contain certain text ingram87 Linux - Software 9 08-03-2012 12:45 PM
Help with script to batch edit text files OnoTadaki Programming 5 10-15-2007 02:44 PM
cant edit text files properly in vi and fluxbox soldan Linux - Newbie 5 11-18-2006 12:02 PM
How do you edit text files in termnial? everydayparadis Linux - Newbie 10 07-08-2004 01:57 PM
convert CSV (TEXT) files to UTF-16 cccc Programming 1 07-01-2004 01:54 AM


All times are GMT -5. The time now is 01:54 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration