Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
06-17-2017, 10:02 AM
|
#1
|
LQ Newbie
Registered: Jun 2017
Posts: 5
Rep: 
|
Need to validate CSV file using shell script
Hi All,
Hope you are doing good. I'm new to shell scripting. So don’t know much.
I need to learn that how unix script to validate csv files and print which file and row having issues in 10000 records.
Scenario:
Suppose many *.csv files are in a directory.
Each CSV file having multiple data types and has 10000 rows with comma separated semi structured data.
Script has to go through all the csv files and check if all the field with comma separated or not. Example – “, ,“(NULL) OR “” (Without comma) OR “|” etc..
then script print the corresponding names on screen which csv files and row having invalid format.
Can anyone help me out this with script. it’s a bit difficult task for a newbie like me.
looking forward from you guys..
Thanks
Sivaraj
Email: sivaraj111raajjz@gmail.com
|
|
|
06-17-2017, 10:20 AM
|
#2
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,718
|
I'd look at perl with Text::CSV from CPAN.
Are you trying to normalize the data format or just flag lines with incorrect syntax?
|
|
|
06-17-2017, 10:23 AM
|
#3
|
LQ Newbie
Registered: Jun 2017
Posts: 5
Original Poster
Rep: 
|
Hi,
Thanks for responding. The agenda is to avoid processing incorrect syntax file and reason for incorrect syntax so that we can give justification to business why we are unable to process specific files.
Thanks
Sivaraj
|
|
|
06-17-2017, 10:26 AM
|
#4
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,718
|
Can you post a few sanitized lines of typical errors you are looking to spot?
Please wrap them in [code] [/code] tags so that they are easier to read.
|
|
|
06-17-2017, 10:44 AM
|
#5
|
LQ Newbie
Registered: Jun 2017
Posts: 5
Original Poster
Rep: 
|
Hi Turbocapitalist,
Thanks. Main agenda is to find incorrect file having incorrect syntax.
Possbile spots are below
1) Space/null ex: [], ,[/]
2) Pipe between string ex: []|[/]
3) New line ex: []
[/]
Thanks
Sivaraj Khamithkar
Last edited by Sivarajkhamithkar; 06-17-2017 at 10:47 AM.
|
|
|
06-17-2017, 10:54 AM
|
#6
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,718
|
Quote:
Originally Posted by Sivarajkhamithkar
Main agenda is to find incorrect file having incorrect syntax.
|
Right. Please show two or three examples, with any sensitive content replaced with xxxxxx or something.
|
|
|
06-17-2017, 11:05 AM
|
#7
|
LQ Newbie
Registered: Jun 2017
Posts: 5
Original Poster
Rep: 
|
sorry for not providing proper information. please find the examples.
1) "EXCEPTION","XXXXX","08/18/2016 09:10:25", ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Autom atic","00000-00000000-m0000-00","END"
Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",
2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",0,"-",END
New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",
Thanks
|
|
|
06-17-2017, 11:28 AM
|
#8
|
LQ Veteran
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: Rocky 9.5
Posts: 5,876
|
Quote:
Originally Posted by Sivarajkhamithkar
sorry for not providing proper information. please find the examples.
Code:
1) "EXCEPTION","XXXXX","08/18/2016 09:10:25", ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Automatic","00000-00000000-m0000-00","END"
|
Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",
The "null space" is simply an empty field in the CSV.
The pipe is inside a field, too. The field contains: ["SYSTEM"|"Internal software event."]
Field by field processing shouldn't have a problem with either data.
What are you doing with the file that an empty field or an embedded pipe causes a problem? I'd think "fixing" those things would cause more problems, since the number of fields wouldn't match whatever your process is expecting.
Quote:
Originally Posted by Sivarajkhamithkar
Code:
2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",0,"-",END
New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",
Thanks
|
Ah yes, fields with embedded newlines...I remember those. If you were to open the file in your favorite spreadsheet program, that field would show up as a multi-line entry. The challenge (as you probably know) is that when reading the csv file, most processes would assume that the first newline is the end of the line of data.
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy. I'm going to be out of pocket for awhile. Sorry.
Last edited by scasey; 06-17-2017 at 11:30 AM.
|
|
|
06-17-2017, 11:42 AM
|
#9
|
LQ Newbie
Registered: Jun 2017
Posts: 5
Original Poster
Rep: 
|
Hi,
Ok thanks. Actually, I'm working in testing area. Our client ask is to validate csv file and provide incorrect syntax csv list to them so that they can avoid injecting data to database.
Please help with script. Thanks!
Thanks
Sivaraj
|
|
|
06-17-2017, 09:03 PM
|
#10
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,551
|
Quote:
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy.
|
Code:
awk 'FPAT="([^,]+)|(\"[^\"]+\")"{x=$0;while((gsub(/\"/,"\"",x)%2)!=0){getline;x=x " ";x=x $0};$0=x;print}' <input.csv>
For more information on awk and CSV data see https://www.gnu.org/software/gawk/ma...ing-By-Content
|
|
|
All times are GMT -5. The time now is 07:34 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|