LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-01-2020, 08:57 AM   #1
mermelmadness
LQ Newbie
 
Registered: Jan 2017
Location: Brooklyn, NY
Posts: 3

Rep: Reputation: Disabled
Perform Series of Edits in CSVs


I have been tasked with making several edits to a few thousand CSVs. Individually I could do each one but I'm looking for a way to script it all together. Also, I don't have much experience working with CSVs so someone more experienced would help prevent me from making a catastrophic mistake. Also, there are two types of CSVs that I'll have to edit, Movie and Series. AFAIK, there will be no external indicator telling me which CSV is which spec. Only the column headers will be an indicator, so I'd need a way to discern from that. Please see requirements below:

IF ‘Series’ Spec is Present in the CSV

REMOVE the following columns

Column D (ep_tms_id)
Column G (ep_number)

RENAME the following columns in Row 1

Column A | FROM ‘ep_filename’ to ‘reference_filename’
Column B | FROM ‘ep_custom_id’ to ‘custom_id’

NEW Column D | FROM ‘ep_title’ to ‘title’
NEW Column E | FROM ‘ep_description’ to ‘description’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

IF ‘Movie’ Spec is Present in the CSV

REMOVE the following columns

Column D (isan)
Column E (eidr)

RENAME the following columns in Row 1

Column A | FROM ‘movie_filename’ to ‘reference_filename’
Column C | FROM ‘add_movie_asset_labels’ to ‘add_asset_labels’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

Any and all help would be greatly appreciated. I'm only a novice at Linux with some sed and awk experience but no where near advanced for this sort of thing. Unfortunately I'm the most experienced person in my group. TYIA.
 
Old 04-01-2020, 11:52 AM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by mermelmadness View Post
I have been tasked with making several edits to a few thousand CSVs. Individually I could do each one but I'm looking for a way to script it all together. Also, I don't have much experience working with CSVs so someone more experienced would help prevent me from making a catastrophic mistake. Also, there are two types of CSVs that I'll have to edit, Movie and Series. AFAIK, there will be no external indicator telling me which CSV is which spec. Only the column headers will be an indicator, so I'd need a way to discern from that. Please see requirements below:

IF ‘Series’ Spec is Present in the CSV

REMOVE the following columns

Column D (ep_tms_id)
Column G (ep_number)

RENAME the following columns in Row 1

Column A | FROM ‘ep_filename’ to ‘reference_filename’
Column B | FROM ‘ep_custom_id’ to ‘custom_id’

NEW Column D | FROM ‘ep_title’ to ‘title’
NEW Column E | FROM ‘ep_description’ to ‘description’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

IF ‘Movie’ Spec is Present in the CSV

REMOVE the following columns

Column D (isan)
Column E (eidr)

RENAME the following columns in Row 1

Column A | FROM ‘movie_filename’ to ‘reference_filename’
Column C | FROM ‘add_movie_asset_labels’ to ‘add_asset_labels’

ADD the Following Column and Hardcoded Value

NEW Column G
Row 1 Value = reference_exclusions
Row 2 Hardcoded Value = 00:00-01:30

Any and all help would be greatly appreciated. I'm only a novice at Linux with some sed and awk experience but no where near advanced for this sort of thing. Unfortunately I'm the most experienced person in my group. TYIA.
We'll be glad to help; so post what you've written/done/tried on your own so far, and post samples of the input data, along with the output data, and read the "Question Guidelines" link in my posting signature.

While you most probably can do this in awk, for someone new in Linux/scripting, it may be easier to do this in bash (in my opinion), since you've got conditional logic involved. The bash scripting tutorials have plenty of examples on how to read a file, split lines on delimiters and assign them to variables. Once you've got those things done, you can loop through the file (or files) and output whatever you need.
 
1 members found this post helpful.
Old 04-01-2020, 12:11 PM   #3
agillator
Member
 
Registered: Aug 2016
Distribution: Mint 19.1
Posts: 419

Rep: Reputation: Disabled
First I will state the obvious that is often overlooked since you mentioned catastrophic mistakes. DO NOT CHANGE THE CSV FILES!!! Change copies of the files.

Having said that, awk, bash and lots of other things can probably help you do the job but it sounds as if you are going to have to learn some language/tool to do it, so as long as you are learning something new you might take a look at perl. This is the kind of thing it was designed for and you need not become particularly proficient for a task like this. You probably would need to open files, read them, and write to other files. Regular expressions would be handy, they are very powerful. Two other commands you might use would be 'split' and 'sprintf'. It may not be worth the effort, I don't know how you think and so on. But you might take a look at a beginning tutorial and see how difficult it would be for YOU. I suspect you could learn what you needed in a couple of hours and build your script as you do it. Although I use bash a good bit I find it ungainly for my purposes while basic perl seems to fit the way I think.
 
1 members found this post helpful.
Old 04-01-2020, 12:13 PM   #4
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
There's a very nice tool that I'd recommend when manipulating CSV data from a shell script, miller (it's packaged as miller in most distros, but in Slackonly its package name is mlr).

Speaking of Perl, you should have a look at CPAN module Text::CSV. It's packaged as libtext-csv-perl in Debian-based distros, and as either perl-Text-CSV or perl-text-csv in most others.

Last edited by shruggy; 04-01-2020 at 12:24 PM.
 
1 members found this post helpful.
Old 04-01-2020, 05:42 PM   #5
agillator
Member
 
Registered: Aug 2016
Distribution: Mint 19.1
Posts: 419

Rep: Reputation: Disabled
In thinking about this today I wondered about the format of your files. Are they text files or can each field be treated as text? What is the separator character? Is there anything special or unusual about the format? If they are fairly straightforward what you want to do looks like it should not be difficult. Please don't go trying to kill a gnat with a cannon. It would be very easy to use a much bigger tool than necessary which could make your job far harder than need be if you don't already know how to use it and I suspect you don't know much about the tools yet.

Just thinking out loud. Let us know your thoughts so we can help. One caveat, though. You know this is just the first time. When you show you can do it and how easily you can do it, it will be your job whenever it needs to be done, anything from hourly to once a decade. Of course that may also make you indispensable - job security.

Last edited by agillator; 04-01-2020 at 05:43 PM.
 
1 members found this post helpful.
Old 04-01-2020, 06:37 PM   #6
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Scripting this in Python would be a good option. It's what I'd do, personally.

https://docs.python.org/3/library/csv.html

There are also command-line CSV parsers. I haven't used the following, but they came up in a search:

https://csvkit.readthedocs.io/en/latest/
https://github.com/BurntSushi/xsv

The second one is by the author of Ripgrep, and is interesting if only for that reason.
 
2 members found this post helpful.
Old 04-01-2020, 06:58 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,128

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Some thoughts from me as well:
- if you don't know perl, this is not the time to start; you seem to have enough to confuse you already.
- if you know a little awk use that - it is designed for columnar data.
- be careful of deleting and adding columns at the same time. Can be done, but carefully. If you can do (all) the deletions first "cut" might be an easier option.
- don't worry about running multiple passes - the file will likely stay resident in page cache and there will be no extra I/O load. Also easier to keep things straight in your head.
- work on a representative subset of the data - say the first 10 lines or similar, and (maybe) let it output to stdout so you can see it in action for mis-steps.

Yep, like probably most people been there, done it.
 
1 members found this post helpful.
Old 04-01-2020, 07:09 PM   #8
mermelmadness
LQ Newbie
 
Registered: Jan 2017
Location: Brooklyn, NY
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by agillator View Post
In thinking about this today I wondered about the format of your files. Are they text files or can each field be treated as text? What is the separator character? Is there anything special or unusual about the format? If they are fairly straightforward what you want to do looks like it should not be difficult. Please don't go trying to kill a gnat with a cannon. It would be very easy to use a much bigger tool than necessary which could make your job far harder than need be if you don't already know how to use it and I suspect you don't know much about the tools yet.

Just thinking out loud. Let us know your thoughts so we can help. One caveat, though. You know this is just the first time. When you show you can do it and how easily you can do it, it will be your job whenever it needs to be done, anything from hourly to once a decade. Of course that may also make you indispensable - job security.
I am very close to asking my group leader to allow me to pursue further training on the groups expense if they continue asking me for these things. It's not the first time but it's by far the most complex task I've been given regarding scripting and Linux CLI. My main trades are video technician and broadcast engineer but I had enough job experience to perform the simpler tasks (or at least find the answer on Stack Overflow.

The CSVs are simple text, no non-alphanumeric characters, and the separators are commas.
 
Old 04-07-2020, 08:55 AM   #9
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,371

Rep: Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750Reputation: 2750
If you are considering using awk, this is a useful tip.
 
2 members found this post helpful.
Old 04-07-2020, 10:09 AM   #10
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by agillator View Post
First I will state the obvious that is often overlooked since you mentioned catastrophic mistakes. DO NOT CHANGE THE CSV FILES!!! Change copies of the files.
This is probably the first applicable information to have considered.

In addition to this, whatever process you develop, I'd start with one file or one file of each type, since you cite there are files containing two types of data.

Make that copy, write a program or script, or command to do this, and also make a copy of the copy as part of that process, so that you still have that original file and a result.

Determine if you've got the concept working, probably involving some debug and retries, just so you know and are prepared.

Once you've refined that for each file. Then maybe work with a handful of each file and develop a loop or automatic-iterative process which will deal with the whole number of files handed to it. But again, do that with 5 or 10 of each file type, and try to mirror any directory structure you may have. I'm assuming that thousands of files are not all in one level of directory?

My points are:
  1. Get "a" concept working.
  2. Test it on some representative files.
  3. Run a test of a structure of files to get loops working or file search and using the search results, plus processing them, working.
  4. Double check that all, perhaps another sample of different files.
  5. Run it all on your "copy" of your originals.
  6. As part of all that, still recommend you do not obliterate your originals and instead make your copied output in a parallel directory.
  7. Consider that with thousands of files, there's likely to be an outlier - that is a file which has a totally invalid format, or unexpected information which causes problems and/or totally derails your processing loop.
  8. Figure out a way to have a log of your program/script which saves your progress, notes what was done, and whether or not there were any errors or warnings which it encountered. Said debug log can certainly be reduced greatly once you've proven your overall technique/algorithm fully.
Just a few tips. I hear text processing and thousands of files. It's not a problem to do the technical feats, the real concern here is doing it without losing your data or corrupting it. Considering that it's a LOT of data, just reiterating that you should approach it with appropriate caution.

Training is usually helpful, but I feel at the raw technical level where it will help you with script writing or programming. Training will not necessarily teach you a process to follow to tackle a project where your likelihood of a successful outcome is raised higher because you've followed a good process. A lot of that is common sense, attention to detail, and also recommended on the backs of some multitude of people who have gone before you and made some really bad clinkers of mistakes already.

If you get specific with a script, please post what you have and ask for some assistance.
 
1 members found this post helpful.
Old 04-29-2020, 11:01 AM   #11
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
cat mmm.mlr
PHP Code:
if (is_present($ep_filename)) {
  $* = {
    
"reference_filename"$ep_filename,
    
"custom_id"$ep_custom_id,
    
"colC"$colC,
    
"title"$ep_title,
    
"description"$ep_description,
    
"colH"$colH,
    
"reference_exclusions""00:00-01:30",
    
"colI"$colI,
  }
elif (is_present($movie_filename)) {
  $* = {
    
"reference_filename"$movie_filename,
    
"colB"$colB,
    
"add_asset_labels"$add_movie_asset_labels,
    
"colF"$colF,
    
"colG"$colG,
    
"colH"$colH,
    
"reference_exclusions""00:00-01:30",
    
"colI"$colI,
  }

cat 1mmm.csv
Code:
ep_filename,ep_custom_id,colC,ep_tms_id,ep_title,ep_description,ep_number,colH,colI
filename,my_id,valC,to_be_removed1,my_title,a_description,to_be_removed2,valH,valI
mlr --csv put -f mmm.mlr 1mmm.csv
Code:
reference_filename,custom_id,colC,title,description,colH,reference_exclusions,colI
filename,my_id,valC,my_title,a_description,valH,00:00-01:30,valI
cat 2mmm.csv
Code:
movie_filename,colB,add_movie_asset_labels,isan,eidr,colF,colG,colH,colI
my_movie_name,valB,my_movie_asset_labels,tbr1,tbr2,valF,valG,valH,valI
mlr --csv put -f mmm.mlr 2mmm.csv
Code:
reference_filename,colB,add_asset_labels,colF,colG,colH,reference_exclusions,colI
my_movie_name,valB,my_movie_asset_labels,valF,valG,valH,00:00-01:30,valI
Adding -I to the mlr command line will process files in-place like sed -i:
Code:
mlr -I ...
____
See http://johnkerl.org/miller/doc/reference-dsl.html

Last edited by shruggy; 04-29-2020 at 11:19 AM.
 
  


Reply

Tags
csv



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Two ugly CSVs LXer Syndicated Linux News 0 04-28-2019 12:28 AM
[SOLVED] How to get sed not to perform edits on the first line asandybox Programming 6 08-31-2012 08:35 AM
Bash: How to remove rows in CSVs where Price column is less than 30 guest Programming 10 04-11-2009 06:44 PM
Script edits KDE-Config? theonebeyond Programming 0 09-22-2004 01:53 AM
file edits not "taking" in website phlewis Linux - General 1 02-07-2004 11:31 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:36 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration