LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-17-2017, 10:02 AM   #1
Sivarajkhamithkar
LQ Newbie
 
Registered: Jun 2017
Posts: 5

Rep: Reputation: Disabled
Need to validate CSV file using shell script


Hi All,

Hope you are doing good. I'm new to shell scripting. So dont know much.
I need to learn that how unix script to validate csv files and print which file and row having issues in 10000 records.

Scenario:
Suppose many *.csv files are in a directory.
Each CSV file having multiple data types and has 10000 rows with comma separated semi structured data.
Script has to go through all the csv files and check if all the field with comma separated or not. Example , ,(NULL) OR (Without comma) OR | etc..
then script print the corresponding names on screen which csv files and row having invalid format.

Can anyone help me out this with script. its a bit difficult task for a newbie like me.

looking forward from you guys..

Thanks
Sivaraj
Email: sivaraj111raajjz@gmail.com
 
Old 06-17-2017, 10:20 AM   #2
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,171
Blog Entries: 3

Rep: Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061
I'd look at perl with Text::CSV from CPAN.

Are you trying to normalize the data format or just flag lines with incorrect syntax?
 
Old 06-17-2017, 10:23 AM   #3
Sivarajkhamithkar
LQ Newbie
 
Registered: Jun 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Hi,

Thanks for responding. The agenda is to avoid processing incorrect syntax file and reason for incorrect syntax so that we can give justification to business why we are unable to process specific files.

Thanks
Sivaraj
 
Old 06-17-2017, 10:26 AM   #4
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,171
Blog Entries: 3

Rep: Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061
Can you post a few sanitized lines of typical errors you are looking to spot?

Please wrap them in [code] [/code] tags so that they are easier to read.
 
Old 06-17-2017, 10:44 AM   #5
Sivarajkhamithkar
LQ Newbie
 
Registered: Jun 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Hi Turbocapitalist,

Thanks. Main agenda is to find incorrect file having incorrect syntax.

Possbile spots are below

1) Space/null ex: [], ,[/]
2) Pipe between string ex: []|[/]
3) New line ex: []
[/]

Thanks
Sivaraj Khamithkar

Last edited by Sivarajkhamithkar; 06-17-2017 at 10:47 AM.
 
Old 06-17-2017, 10:54 AM   #6
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,171
Blog Entries: 3

Rep: Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061Reputation: 2061
Quote:
Originally Posted by Sivarajkhamithkar View Post
Main agenda is to find incorrect file having incorrect syntax.
Right. Please show two or three examples, with any sensitive content replaced with xxxxxx or something.
 
Old 06-17-2017, 11:05 AM   #7
Sivarajkhamithkar
LQ Newbie
 
Registered: Jun 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
sorry for not providing proper information. please find the examples.

1) "EXCEPTION","XXXXX","08/18/2016 09:10:25", ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Autom atic","00000-00000000-m0000-00","END"

Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",

2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",0,"-",END

New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",


Thanks
 
Old 06-17-2017, 11:28 AM   #8
scasey
Senior Member
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.6
Posts: 3,799

Rep: Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279Reputation: 1279
Quote:
Originally Posted by Sivarajkhamithkar View Post
sorry for not providing proper information. please find the examples.

Code:
1) "EXCEPTION","XXXXX","08/18/2016 09:10:25",   ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Automatic","00000-00000000-m0000-00","END"
Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",

The "null space" is simply an empty field in the CSV.
The pipe is inside a field, too. The field contains: ["SYSTEM"|"Internal software event."]
Field by field processing shouldn't have a problem with either data.
What are you doing with the file that an empty field or an embedded pipe causes a problem? I'd think "fixing" those things would cause more problems, since the number of fields wouldn't match whatever your process is expecting.

Quote:
Originally Posted by Sivarajkhamithkar View Post
Code:
2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
  Test Status:         Finished
  Measurement Quality: OK
  Result Validation:   In Limits
  Machine Constants:   N.A.",0,"-",END
New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",

Thanks
Ah yes, fields with embedded newlines...I remember those. If you were to open the file in your favorite spreadsheet program, that field would show up as a multi-line entry. The challenge (as you probably know) is that when reading the csv file, most processes would assume that the first newline is the end of the line of data.
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy. I'm going to be out of pocket for awhile. Sorry.

Last edited by scasey; 06-17-2017 at 11:30 AM.
 
Old 06-17-2017, 11:42 AM   #9
Sivarajkhamithkar
LQ Newbie
 
Registered: Jun 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Hi,

Ok thanks. Actually, I'm working in testing area. Our client ask is to validate csv file and provide incorrect syntax csv list to them so that they can avoid injecting data to database.

Please help with script. Thanks!

Thanks
Sivaraj
 
Old 06-17-2017, 09:03 PM   #10
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,278

Rep: Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921Reputation: 1921
Quote:
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy.
Code:
awk 'FPAT="([^,]+)|(\"[^\"]+\")"{x=$0;while((gsub(/\"/,"\"",x)%2)!=0){getline;x=x " ";x=x $0};$0=x;print}' <input.csv>
For more information on awk and CSV data see https://www.gnu.org/software/gawk/ma...ing-By-Content
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Script to ingest a csv, validate data and insert into Oracle md888 Programming 4 10-01-2015 07:59 PM
In Shell script, csv file is not generating rks10a Programming 10 10-04-2012 12:30 AM
Need shell script help with a CSV file. lothario Linux - Software 2 08-31-2012 11:40 AM
Shell Script for CSV file comparision aravind_balan Programming 1 02-17-2009 03:33 AM
Shell script to read from csv file hendemeg Programming 1 05-11-2004 08:23 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:37 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration