Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hope you are doing good. I'm new to shell scripting. So don’t know much.
I need to learn that how unix script to validate csv files and print which file and row having issues in 10000 records.
Scenario:
Suppose many *.csv files are in a directory.
Each CSV file having multiple data types and has 10000 rows with comma separated semi structured data.
Script has to go through all the csv files and check if all the field with comma separated or not. Example – “, ,“(NULL) OR “” (Without comma) OR “|” etc..
then script print the corresponding names on screen which csv files and row having invalid format.
Can anyone help me out this with script. it’s a bit difficult task for a newbie like me.
Thanks for responding. The agenda is to avoid processing incorrect syntax file and reason for incorrect syntax so that we can give justification to business why we are unable to process specific files.
sorry for not providing proper information. please find the examples.
1) "EXCEPTION","XXXXX","08/18/2016 09:10:25", ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Autom atic","00000-00000000-m0000-00","END"
Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",
2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",0,"-",END
New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",
sorry for not providing proper information. please find the examples.
Code:
1) "EXCEPTION","XXXXX","08/18/2016 09:10:25", ,-1,"SYSTEM"|"Internal software event.","Get parameter PID contror for 'sdfds_GROUP_10' is set to open_loop:0 is set to closed_loop:140",0,"-","-","-","-","-","-","-","-","-","-","-","-","-","-",XXXXX,"XXXXX","VDN [937637728]","?.?",116," ","00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000;00000000","Automatic","00000-00000000-m0000-00","END"
Null/Space --> "08/18/2016 09:10:25", ,-1,
Pipe symbol between --> "SYSTEM"|"Internal software event.",
The "null space" is simply an empty field in the CSV.
The pipe is inside a field, too. The field contains: ["SYSTEM"|"Internal software event."]
Field by field processing shouldn't have a problem with either data.
What are you doing with the file that an empty field or an embedded pipe causes a problem? I'd think "fixing" those things would cause more problems, since the number of fields wouldn't match whatever your process is expecting.
Quote:
Originally Posted by Sivarajkhamithkar
Code:
2) "EXCEPTION","XXXX","08/18/2016 09:10:35",490588,-1,"USER","XXXX","Test run finished.","Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",0,"-",END
New line between comma's.
,"Test Run Results for sfadsfdsf (DGSS Droplet Stability Test)
Test Status: Finished
Measurement Quality: OK
Result Validation: In Limits
Machine Constants: N.A.",
Thanks
Ah yes, fields with embedded newlines...I remember those. If you were to open the file in your favorite spreadsheet program, that field would show up as a multi-line entry. The challenge (as you probably know) is that when reading the csv file, most processes would assume that the first newline is the end of the line of data.
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy. I'm going to be out of pocket for awhile. Sorry.
Ok thanks. Actually, I'm working in testing area. Our client ask is to validate csv file and provide incorrect syntax csv list to them so that they can avoid injecting data to database.
The file needs to be pre-processed to replace line feeds with spaces only when found between commas (or quotes, since your alpha fields are all quoted). I'll need to search for the regular expression to do that (I've got it somewhere). Maybe someone else has one handy.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.