Perform Series of Edits in CSVs

mermelmadness · 04-01-2020, 08:57 AM

I have been tasked with making several edits to a few thousand CSVs. Individually I could do each one but I'm looking for a way to script it all together. Also, I don't have much experience working with CSVs so someone more experienced would help prevent me from making a catastrophic mistake. Also, there are two types of CSVs that I'll have to edit, Movie and Series. AFAIK, there will be no external indicator telling me which CSV is which spec. Only the column headers will be an indicator, so I'd need a way to discern from that. Please see requirements below:

IF ‘Series’ Spec is Present in the CSV

REMOVE the following columns

Column D (ep_tms_id)
Column G (ep_number)

RENAME the following columns in Row 1

Column A | FROM ‘ep_filename’ to ‘reference_filename’
Column B | FROM ‘ep_custom_id’ to ‘custom_id’

NEW Column D | FROM ‘ep_title’ to ‘title’
NEW Column E | FROM ‘ep_description’ to ‘description’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

IF ‘Movie’ Spec is Present in the CSV

REMOVE the following columns

Column D (isan)
Column E (eidr)

RENAME the following columns in Row 1

Column A | FROM ‘movie_filename’ to ‘reference_filename’
Column C | FROM ‘add_movie_asset_labels’ to ‘add_asset_labels’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

Any and all help would be greatly appreciated. I'm only a novice at Linux with some sed and awk experience but no where near advanced for this sort of thing. Unfortunately I'm the most experienced person in my group. TYIA.

TB0ne · 04-01-2020, 11:52 AM

Quote:

Originally Posted by mermelmadness

I have been tasked with making several edits to a few thousand CSVs. Individually I could do each one but I'm looking for a way to script it all together. Also, I don't have much experience working with CSVs so someone more experienced would help prevent me from making a catastrophic mistake. Also, there are two types of CSVs that I'll have to edit, Movie and Series. AFAIK, there will be no external indicator telling me which CSV is which spec. Only the column headers will be an indicator, so I'd need a way to discern from that. Please see requirements below:

IF ‘Series’ Spec is Present in the CSV

REMOVE the following columns

Column D (ep_tms_id)
Column G (ep_number)

RENAME the following columns in Row 1

Column A | FROM ‘ep_filename’ to ‘reference_filename’
Column B | FROM ‘ep_custom_id’ to ‘custom_id’

NEW Column D | FROM ‘ep_title’ to ‘title’
NEW Column E | FROM ‘ep_description’ to ‘description’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

IF ‘Movie’ Spec is Present in the CSV

REMOVE the following columns

Column D (isan)
Column E (eidr)

RENAME the following columns in Row 1

Column A | FROM ‘movie_filename’ to ‘reference_filename’
Column C | FROM ‘add_movie_asset_labels’ to ‘add_asset_labels’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

Any and all help would be greatly appreciated. I'm only a novice at Linux with some sed and awk experience but no where near advanced for this sort of thing. Unfortunately I'm the most experienced person in my group. TYIA.

We'll be glad to help; so post what you've written/done/tried on your own so far, and post samples of the input data, along with the output data, and read the "Question Guidelines" link in my posting signature.

While you most probably can do this in awk, for someone new in Linux/scripting, it may be easier to do this in bash (in my opinion), since you've got conditional logic involved. The bash scripting tutorials have plenty of examples on how to read a file, split lines on delimiters and assign them to variables. Once you've got those things done, you can loop through the file (or files) and output whatever you need.

agillator · 04-01-2020, 12:11 PM

First I will state the obvious that is often overlooked since you mentioned catastrophic mistakes. DO NOT CHANGE THE CSV FILES!!! Change copies of the files.

Having said that, awk, bash and lots of other things can probably help you do the job but it sounds as if you are going to have to learn some language/tool to do it, so as long as you are learning something new you might take a look at perl. This is the kind of thing it was designed for and you need not become particularly proficient for a task like this. You probably would need to open files, read them, and write to other files. Regular expressions would be handy, they are very powerful. Two other commands you might use would be 'split' and 'sprintf'. It may not be worth the effort, I don't know how you think and so on. But you might take a look at a beginning tutorial and see how difficult it would be for YOU. I suspect you could learn what you needed in a couple of hours and build your script as you do it. Although I use bash a good bit I find it ungainly for my purposes while basic perl seems to fit the way I think.

shruggy · 04-01-2020, 12:13 PM

There's a very nice tool that I'd recommend when manipulating CSV data from a shell script, miller (it's packaged as miller in most distros, but in Slackonly its package name is mlr).

Speaking of Perl, you should have a look at CPAN module Text::CSV. It's packaged as libtext-csv-perl in Debian-based distros, and as either perl-Text-CSV or perl-text-csv in most others.

agillator · 04-01-2020, 05:42 PM

In thinking about this today I wondered about the format of your files. Are they text files or can each field be treated as text? What is the separator character? Is there anything special or unusual about the format? If they are fairly straightforward what you want to do looks like it should not be difficult. Please don't go trying to kill a gnat with a cannon. It would be very easy to use a much bigger tool than necessary which could make your job far harder than need be if you don't already know how to use it and I suspect you don't know much about the tools yet.

Just thinking out loud. Let us know your thoughts so we can help. One caveat, though. You know this is just the first time. When you show you can do it and how easily you can do it, it will be your job whenever it needs to be done, anything from hourly to once a decade. Of course that may also make you indispensable - job security.

dugan · 04-01-2020, 06:37 PM

Scripting this in Python would be a good option. It's what I'd do, personally.

https://docs.python.org/3/library/csv.html

There are also command-line CSV parsers. I haven't used the following, but they came up in a search:

https://csvkit.readthedocs.io/en/latest/
https://github.com/BurntSushi/xsv

The second one is by the author of Ripgrep, and is interesting if only for that reason.

syg00 · 04-01-2020, 06:58 PM

Some thoughts from me as well:
- if you don't know perl, this is not the time to start; you seem to have enough to confuse you already.
- if you know a little awk use that - it is designed for columnar data.
- be careful of deleting and adding columns at the same time. Can be done, but carefully. If you can do (all) the deletions first "cut" might be an easier option.
- don't worry about running multiple passes - the file will likely stay resident in page cache and there will be no extra I/O load. Also easier to keep things straight in your head.
- work on a representative subset of the data - say the first 10 lines or similar, and (maybe) let it output to stdout so you can see it in action for mis-steps.

Yep, like probably most people been there, done it.

mermelmadness · 04-01-2020, 07:09 PM

Quote:

Originally Posted by agillator

In thinking about this today I wondered about the format of your files. Are they text files or can each field be treated as text? What is the separator character? Is there anything special or unusual about the format? If they are fairly straightforward what you want to do looks like it should not be difficult. Please don't go trying to kill a gnat with a cannon. It would be very easy to use a much bigger tool than necessary which could make your job far harder than need be if you don't already know how to use it and I suspect you don't know much about the tools yet.

Just thinking out loud. Let us know your thoughts so we can help. One caveat, though. You know this is just the first time. When you show you can do it and how easily you can do it, it will be your job whenever it needs to be done, anything from hourly to once a decade. Of course that may also make you indispensable - job security.

I am very close to asking my group leader to allow me to pursue further training on the groups expense if they continue asking me for these things. It's not the first time but it's by far the most complex task I've been given regarding scripting and Linux CLI. My main trades are video technician and broadcast engineer but I had enough job experience to perform the simpler tasks (or at least find the answer on Stack Overflow.

The CSVs are simple text, no non-alphanumeric characters, and the separators are commas.

allend · 04-07-2020, 08:55 AM

If you are considering using awk, this is a useful tip.

rtmistler · 04-07-2020, 10:09 AM

Quote:

Originally Posted by agillator

First I will state the obvious that is often overlooked since you mentioned catastrophic mistakes. DO NOT CHANGE THE CSV FILES!!! Change copies of the files.

This is probably the first applicable information to have considered.

In addition to this, whatever process you develop, I'd start with one file or one file of each type, since you cite there are files containing two types of data.

Make that copy, write a program or script, or command to do this, and also make a copy of the copy as part of that process, so that you still have that original file and a result.

Determine if you've got the concept working, probably involving some debug and retries, just so you know and are prepared.

Once you've refined that for each file. Then maybe work with a handful of each file and develop a loop or automatic-iterative process which will deal with the whole number of files handed to it. But again, do that with 5 or 10 of each file type, and try to mirror any directory structure you may have. I'm assuming that thousands of files are not all in one level of directory?

My points are:

Get "a" concept working.
Test it on some representative files.
Run a test of a structure of files to get loops working or file search and using the search results, plus processing them, working.
Double check that all, perhaps another sample of different files.
Run it all on your "copy" of your originals.
As part of all that, still recommend you do not obliterate your originals and instead make your copied output in a parallel directory.
Consider that with thousands of files, there's likely to be an outlier - that is a file which has a totally invalid format, or unexpected information which causes problems and/or totally derails your processing loop.
Figure out a way to have a log of your program/script which saves your progress, notes what was done, and whether or not there were any errors or warnings which it encountered. Said debug log can certainly be reduced greatly once you've proven your overall technique/algorithm fully.

Just a few tips. I hear text processing and thousands of files. It's not a problem to do the technical feats, the real concern here is doing it without losing your data or corrupting it. Considering that it's a LOT of data, just reiterating that you should approach it with appropriate caution.

Training is usually helpful, but I feel at the raw technical level where it will help you with script writing or programming. Training will not necessarily teach you a process to follow to tackle a project where your likelihood of a successful outcome is raised higher because you've followed a good process. A lot of that is common sense, attention to detail, and also recommended on the backs of some multitude of people who have gone before you and made some really bad clinkers of mistakes already.

If you get specific with a script, please post what you have and ask for some assistance.

shruggy · 04-29-2020, 11:01 AM

cat mmm.mlr

PHP Code:



if (is_present($ep_filename)) {
  $* = {
    "reference_filename": $ep_filename,
    "custom_id": $ep_custom_id,
    "colC": $colC,
    "title": $ep_title,
    "description": $ep_description,
    "colH": $colH,
    "reference_exclusions": "00:00-01:30",
    "colI": $colI,
  }
} elif (is_present($movie_filename)) {
  $* = {
    "reference_filename": $movie_filename,
    "colB": $colB,
    "add_asset_labels": $add_movie_asset_labels,
    "colF": $colF,
    "colG": $colG,
    "colH": $colH,
    "reference_exclusions": "00:00-01:30",
    "colI": $colI,
  }
}

cat 1mmm.csv

Code:

ep_filename,ep_custom_id,colC,ep_tms_id,ep_title,ep_description,ep_number,colH,colI
filename,my_id,valC,to_be_removed1,my_title,a_description,to_be_removed2,valH,valI

mlr --csv put -f mmm.mlr 1mmm.csv

Code:

reference_filename,custom_id,colC,title,description,colH,reference_exclusions,colI
filename,my_id,valC,my_title,a_description,valH,00:00-01:30,valI

cat 2mmm.csv

Code:

movie_filename,colB,add_movie_asset_labels,isan,eidr,colF,colG,colH,colI
my_movie_name,valB,my_movie_asset_labels,tbr1,tbr2,valF,valG,valH,valI

mlr --csv put -f mmm.mlr 2mmm.csv

Code:

reference_filename,colB,add_asset_labels,colF,colG,colH,reference_exclusions,colI
my_movie_name,valB,my_movie_asset_labels,valF,valG,valH,00:00-01:30,valI

Adding -I to the mlr command line will process files in-place like sed -i:

Code:

mlr -I ...

____
See http://johnkerl.org/miller/doc/reference-dsl.html