LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-10-2021, 03:37 AM   #1
satyarankireddy
Member
 
Registered: Mar 2019
Posts: 48

Rep: Reputation: Disabled
Perl command performance issue in Linux


I've used the following command to remove special characters in number columns (.CSV file) and it is working fine as excepted but the issue here is performance. My CSV file number column data contains as below. To remove 1000 separator comma in data I've used the following Perl command.
Quote:
payment
"4,326.34"
590.20
"12,499.40"
Note: My file delimiter was "," comma.

input file :
Organization,Amount,Revenue,Balance,Desc
Congos,"4,233.78","3,233.78","1,233.78",Payment
Toyoto,590.2,390.2,190.2,Payment
lenives,"5,234.89","2,234.89","1,234.89",Payment


Excepted OutPut:
Organization,Amount,Revenue,Balance,Desc
Congos,4233.78,3233.78,1233.78,Payment
Toyoto,590.2,390.2,190.2,Payment
lenives,5234.89,2234.89,1234.89,Payment


Command : cat | perl -p -e 's/,(?=[\d,.]\d")//g and s/"(\d[\d,.])"/\1/g' 'test.csv' >> newfile.csv

File data count: 11 millions data

Issue: It was taking almost 10 minutes to remove 1000 separate "comma" in data.

Any better solution to improve performance?

Last edited by satyarankireddy; 04-10-2021 at 05:08 AM.
 
Old 04-10-2021, 06:12 AM   #2
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,600

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546

Why are you trying to remove the thousand separator in the numbers?

Often people try to do that when they're trying to handle CSV formatted data without using a CSV parser - there are plenty of flexible CLI CSV parsing tools that will avoid needing to jump through hoops like this.


(Also, why are you piping "cat" into the perl command?)


Last edited by boughtonp; 04-10-2021 at 06:13 AM.
 
Old 04-10-2021, 06:21 AM   #3
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,600

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546

Also, five minutes before posting here, you posted the same question here: https://www.linux.org/threads/perl-performance-issue.33962

It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)

 
Old 04-10-2021, 09:20 AM   #4
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
A good list of CLI CSV parsing tools can be found at
https://github.com/dbohdan/structured-text-tools

As boughtonp said, use one (or several) of the tools from the list to work directly with your data.

Another possibility is to convert CSV to another format before processing the data. Many tools from the list above (and even more from not listed there) are capable of converting CSV to TSV (tab separated values) which is easier to work with. But there are other options as well (e.g. csvquote).

That said, there are options specifically tailored to the task at hand. Like
Code:
csvfix number -smq -f 2,3,4 test.csv
Or
Code:
csvtk -l replace -f2-4 -p'[",]' -r '' <test.csv
Or (less reliable, but blazingly fast):
Code:
csvquote test.csv|tr -d '\37'|csvquote -u
It doesn't remove double quotes though. A better take on this would be preselecting relevant fields with teip before feeding them to tr:
Code:
csvquote test.csv|teip -d, -f2-4 -- tr -d '"\37'|csvquote -u
Or using sed:
Code:
csvquote test.csv|
  sed -E '/\c_/s/"(-?[0-9]+)((\c_)([0-9]+))+(.[0-9]{1,2})?"/\1\4\5/g'|
  csvquote -u
Depending on the data, it might be OK to run sed even without csvquote:
Code:
sed -E '/,"-?[0-9]+,/s/"(-?[0-9]+)((,)([0-9]+))+(.[0-9]{1,2})?"/\1\4\5/g'
If that is the case then this one could be slightly faster. Or not.
Code:
teip -og'"-?[0-9]+(,[0-9]+)+(.[0-9]{1,2})?"' -- tr -d ,\" <test.csv
But it would be less safe as well. The problem with the last two commands is that there are many CSV dialects in the wild, not all of them strictly following RFC 4180. In some CSV dialects, it's entirely possible to have organization names (the first field) like say Code "Omega"-1 and them be unquoted. Besides, the last field (Desc) could also include numbers and quotes and be unquoted. Consider records like
Code:
Code "Omega"-1,222,333,444,5.5"-high heels
The last command would mangle it to
Code:
Code "Omega"-12223334445.5"-high heels
In this particular case, the teip regular expression could be fixed using Oniguruma RE engine which is more powerful, but also slower than the GNU RE engine:
Code:
teip -oGg',\K"-?[0-9]+(,[0-9]+)+(.[0-9]{1,2})?"' -- tr -d ,\" <test.csv
but that's beyond the point I'm trying to get across.

Actually, RFC 4180 states
Quote:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
[...]
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
So in an RFC 4180-conform CSV that record might look like
Code:
"Code ""Omega""-1",222,333,444,"5.5""-high heels"
but you can't count on that.

I cannot think offhand of any example data that would trip the sed command above, but that doesn't mean they don't exist. All that prevented sed from tripping badly in this example was the first comma at the beginning of expression. Hopefully, this shows why parsing CSV with regular expressions is dangerous.

Last edited by shruggy; 04-11-2021 at 01:28 PM.
 
1 members found this post helpful.
Old 04-10-2021, 12:30 PM   #5
satyarankireddy
Member
 
Registered: Mar 2019
Posts: 48

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by boughtonp View Post
Also, five minutes before posting here, you posted the same question here: https://www.linux.org/threads/perl-performance-issue.33962

It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)

To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.
 
Old 04-10-2021, 01:49 PM   #6
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
For completeness sake, here are solutions in miller
Code:
mlr --csv put '
  func n(n){return is_numeric(n)?fmtnum(float(n),"%.2lf"):gsub(n, ",", "")}
  for(k in mapexcept($*,$[[1]],"Desc")){$[k]=n($[k])}
  ' test.csv
gawk
Code:
gawk -vFPAT='[^,]*|"([^"]|"")+"' '
  $2$3$4~/,/{for(i=2;i<5;i++)gsub(/[",]/,"",$i)}1
  ' test.csv
and gawk-csv
Code:
gawk -icsv -vCSVMODE=1 -vCSVFS=, '
  $2$3$4~/,/{for(i=2;i<5;i++)gsub(/,/,"",$i)}1
  ' test.csv

Last edited by shruggy; 04-11-2021 at 03:02 PM.
 
1 members found this post helpful.
Old 04-10-2021, 01:59 PM   #7
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by satyarankireddy View Post
To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.
Right...except you CAN'T delete you posts on other platforms, anymore than you can delete it here. Funny, you RESPONDED in your other thread, and asked follow up questions.
https://unix.stackexchange.com/quest...formance-issue

And you've been asking about CSV files and parsing for TWO YEARS now. You are essentially using sed on 11 MILLION lines...why are you surprised it's taking a while? And since you've told us NOTHING about your system, the file, disk, CPU, or anything about the Linux you're running, why do you expect us to guess on why it's taking a while, or how to fix the problem???

Write an actual program, using actual parsing utilities if you want better speed. After two years, you should be at least able to do that. And as boughtonp asked, why are you removing the thousand separator in the first place? Isn't for database insertion, since that's easily handled by the DB itself (if the field is defined correctly).

Last edited by TB0ne; 04-10-2021 at 02:05 PM.
 
Old 04-11-2021, 12:19 AM   #8
satyarankireddy
Member
 
Registered: Mar 2019
Posts: 48

Original Poster
Rep: Reputation: Disabled
Hello All,

Thanks for your time. I'm closing this thread why bcz working with other folks in another forum.

Thanks
 
Old 04-11-2021, 04:45 PM   #9
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by satyarankireddy View Post
Hello All,
Thanks for your time. I'm closing this thread why bcz working with other folks in another forum.

Thanks
Read the LQ Rules about not using text speak. And you're 'working with other folks' in this case means, "Someone else wrote me a script in another forum", I presume?
 
  


Reply

Tags
csv, perlscript, shell script



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Plain Old Documentation (pod) – write documentation for Perl, Perl software, and Perl modules LXer Syndicated Linux News 0 10-05-2018 09:12 AM
[SOLVED] PERL, How to pipe out put of a perl script for processing with linux command. Soji Antony Programming 3 11-16-2012 09:04 AM
Use of "Command line perl" in perl script using system command. aditya007 Linux - Newbie 4 11-29-2009 10:08 PM
Perl datescript Issue, missing perl module? stefaandk Programming 5 02-19-2006 10:55 PM
perl(Cwd) perl(File::Basename) perl(File::Copy) perl(strict)....What are those? Baldorg Linux - Software 1 11-09-2003 08:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration