LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Parsing XLS or CSV in Perl - what and what not (https://www.linuxquestions.org/questions/programming-9/parsing-xls-or-csv-in-perl-what-and-what-not-410339/)

me_the_apprentice 02-01-2006 06:32 AM

Parsing XLS or CSV in Perl - what and what not
 
Hi Everyone!!
You might be aware of "use SpreadSheet::ParseExcel" thing in Perl. I am not pro on that/Perl but want to understand this.. what we shall prefer -- a CSV file or a plain XLS file for reading the parsing the data. What are the pros and cons of using CSV's over XLS; where we shall prefer them over XLS and where not.

Thanks for your comments/thoughts in advance...

Michael_S 02-01-2006 01:37 PM

Can you be more specific as to what you're trying to do?

I haven't used any software that parses XLS files (except for Excel itself, naturally).

Comma-Separated Value files are easy to use and manipulate, but obviously they lack some things that are built into the XLS file format. The company I work for has databases for standalone applications at sites, and we extract data from each table as .csv files, zip the files, encrypt the zip archive, FTP it to the main office, and run an application to unencrypt, extract, and integrate the data into the main database.

Our application uses Java just because all of the other software we use is Java. But each step in the process would be easy to do with Perl.

chrism01 02-01-2006 10:23 PM

Concur with Michael_S; are these spreadsheets basically data or do you want to be able to handle spreadsheets with lots of formulae/macros etc?
Most tools eg DBs have an 'import as CSV' option for data, so I'd go with that if possible.
The other issue is that different versions of Excel may need to be parsed differently for complex spreradsheets, as MS changes things (ie similar to the .doc 'std' changes gradually)

me_the_apprentice 02-02-2006 10:10 PM

Ok!
So here's the deal; I am a Build and Release Admin, the developer put-into it varied information like CVS version of code file, location, domain it belongs to and DB schema the code needs to get released to etc. I dont need Macro's and all (I didnt faced any issues as of now with different version of MSExcel). But recently, an issue came up -- a developer typed in CVS version as 1.10 and it got translated to "1.1".

So wanted to understand whether these kind of "automatic" manipulations occur in CSV's as well. Quite obviously, I am thinking of three approaches:
#1. I shall make my code better to take care of such small issues; (though I am not very comfortable as an error can come my way then some other wil rise up on later date)
#2. Find out if there are some filters that can be put on format of information that gets fed into these spreadsheets; but again, how to ensure that it gets read as is and no translation happens on top of it :-)
#3. I switch to CVS from XLS -- but not sure about what will be the impact on my code -- will I still be able to use "use SpreadExcel::ParseExcel". Moreover, what will be the impact on overall performance of my code. If it doesnt matter much; then what are the things to watch out for when working with CSV's. If you have some URL or some doc on this; then plz plz share it.

Again, much thanks for your thoughts!!
- The Apprentice

bigearsbilly 02-03-2006 10:49 AM

Quote:

developer typed in CVS version as 1.10 and it got translated to "1.1".
surely a case of the wrong format for a column?

Quote:

whether these kind of "automatic" manipulations occur in CSV's as well
depends on the translator I'd say.
In my experience excel is absolutely *crap* with CSV.
Check the data carefully.
It's really brain dead with commas even inside quotes.

Michael_S 02-03-2006 03:09 PM

Do you mean CSV as in Comma Seperated Value, or CVS, as in Concurrent Versioning System? Build and release tracking sounds like CVS, not CSV.

If you are talking about CSV, then I'm not sure I can help. I think the way data is exported or imported is pretty dependent upon the program you use.

me_the_apprentice 02-04-2006 12:07 AM

Michael, you gave it a thought, that itself honoured my question. Sorry for the typo; I meant Comma Seperated Files only (also, we are using CVS tool just like any other development project for maintaining the document/code repositories).

Bigearsbilly, it would be really good for this community (and me) if you can list down things that one shall be careful about (from testing perspective) when migrating to CSV's from XLS.

Thanks!!
- The Apprentice

chrism01 02-07-2006 10:36 PM

If you'd like a DB style approach, try this: http://search.cpan.org/~jzucker/DBD-...dle/DBD/CSV.pm
or
http://search.cpan.org/~tmtm/Text-CS.../CSV/Simple.pm, & see the note about Text::CSV_XS module on this page.
If you are the Build & Release admin, I'd suggest storing the orig input plus actions taken & results in a simple MySQL DB, so you can track for your & your mgr's benefit,
unless you think you are dealing with a trivial amt of work.
You could use this to automate builds perhaps... & track dependencies...

bigearsbilly 02-09-2006 03:50 AM

Well, I find excel to be pathetic in how it translates CSVs. For a format it is supposed
to understand. In particular if you have a comma in a field, if it is in quotes it
should be ignored. But my excel is so thick you end up with a new column regardless.
V. irritating. So I end up stripping the commas out.

Michael_S 02-09-2006 08:06 AM

Quote:

Originally Posted by bigearsbilly
Well, I find excel to be pathetic in how it translates CSVs. For a format it is supposed
to understand. In particular if you have a comma in a field, if it is in quotes it
should be ignored. But my excel is so thick you end up with a new column regardless.
V. irritating. So I end up stripping the commas out.

Can you escape the commas with a backslash or some other character?

bigearsbilly 02-09-2006 08:32 AM

I dunno. Wasn't really a priority so I didn't go further.

I think Open Office was okay with it though if i remember.

bigearsbilly 02-24-2006 08:11 AM

Well. If you are still interested.
Have been parsing a spreadsheet today.
My excel also does XML output and something called IDF.

But, I found the best results where to save->as "tab delimited text" then
you almost certainly won't have a delimiter tab in the data itself.
(tab switches to next cell usually so you can't do it normally)
then you can easily split the data.

Of course for a simple sheet this works.

schneidz 02-24-2006 11:45 AM

my experience is with dumping data to a csv and parsing it with shell script or 'c'.

csv is better because it is portable on any os. (because it is straight ascii text as opposed to proprietary bin format)

i think i heard a cammand line xls2csv program but not sure how/ if it works.

there are probably ways to escape ',' and '/' in the script but i found it much easier never to use those charecters in excel.

sorry billy, i'm not a perl fan as you are -you seem to post a perl solution every post (j/k - just kidding) but at least i read them.

for csv awk will be your friend.

also excel tends to auto format cells with preceding 0's or numbers not stored in $$$.¢¢ format so i format everything as text.

good luck.

bigearsbilly 02-28-2006 03:44 AM

it all boils down to "know your data, love your data" (maybe not love)

perl moi?
I used to hate perl, now I love it. (Same with vi)
After a few months with it I absolutely adore it.

Try it, persevere, it's absolutely fab.


All times are GMT -5. The time now is 11:20 PM.