[SOLVED] Perl command performance issue in Linux

satyarankireddy · 04-10-2021, 03:37 AM

I've used the following command to remove special characters in number columns (.CSV file) and it is working fine as excepted but the issue here is performance. My CSV file number column data contains as below. To remove 1000 separator comma in data I've used the following Perl command.

Quote:

payment
"4,326.34"
590.20
"12,499.40"

Note: My file delimiter was "," comma.

input file :
Organization,Amount,Revenue,Balance,Desc
Congos,"4,233.78","3,233.78","1,233.78",Payment
Toyoto,590.2,390.2,190.2,Payment
lenives,"5,234.89","2,234.89","1,234.89",Payment

Excepted OutPut:
Organization,Amount,Revenue,Balance,Desc
Congos,4233.78,3233.78,1233.78,Payment
Toyoto,590.2,390.2,190.2,Payment
lenives,5234.89,2234.89,1234.89,Payment

Command : cat | perl -p -e 's/,(?=[\d,.]\d")//g and s/"(\d[\d,.])"/\1/g' 'test.csv' >> newfile.csv

File data count: 11 millions data

Issue: It was taking almost 10 minutes to remove 1000 separate "comma" in data.

Any better solution to improve performance?

boughtonp · 04-10-2021, 06:12 AM

Why are you trying to remove the thousand separator in the numbers?

Often people try to do that when they're trying to handle CSV formatted data without using a CSV parser - there are plenty of flexible CLI CSV parsing tools that will avoid needing to jump through hoops like this.

(Also, why are you piping "cat" into the perl command?)

boughtonp · 04-10-2021, 06:21 AM

Also, five minutes before posting here, you posted the same question here: https://www.linux.org/threads/perl-performance-issue.33962

It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)

shruggy · 04-10-2021, 09:20 AM

A good list of CLI CSV parsing tools can be found at
https://github.com/dbohdan/structured-text-tools

As boughtonp said, use one (or several) of the tools from the list to work directly with your data.

Another possibility is to convert CSV to another format before processing the data. Many tools from the list above (and even more from not listed there) are capable of converting CSV to TSV (tab separated values) which is easier to work with. But there are other options as well (e.g. csvquote).

That said, there are options specifically tailored to the task at hand. Like

Code:

csvfix number -smq -f 2,3,4 test.csv

Or

Code:

csvtk -l replace -f2-4 -p'[",]' -r '' <test.csv

Or (less reliable, but blazingly fast):

Code:

csvquote test.csv|tr -d '\37'|csvquote -u

It doesn't remove double quotes though. A better take on this would be preselecting relevant fields with teip before feeding them to tr:

Code:

csvquote test.csv|teip -d, -f2-4 -- tr -d '"\37'|csvquote -u

Or using sed:

Code:

csvquote test.csv|
  sed -E '/\c_/s/"(-?[0-9]+)((\c_)([0-9]+))+(.[0-9]{1,2})?"/\1\4\5/g'|
  csvquote -u

Depending on the data, it might be OK to run sed even without csvquote:

Code:

sed -E '/,"-?[0-9]+,/s/"(-?[0-9]+)((,)([0-9]+))+(.[0-9]{1,2})?"/\1\4\5/g'

If that is the case then this one could be slightly faster. Or not.

Code:

teip -og'"-?[0-9]+(,[0-9]+)+(.[0-9]{1,2})?"' -- tr -d ,\" <test.csv

But it would be less safe as well. The problem with the last two commands is that there are many CSV dialects in the wild, not all of them strictly following RFC 4180. In some CSV dialects, it's entirely possible to have organization names (the first field) like say Code "Omega"-1 and them be unquoted. Besides, the last field (Desc) could also include numbers and quotes and be unquoted. Consider records like

Code:

Code "Omega"-1,222,333,444,5.5"-high heels

The last command would mangle it to

Code:

Code "Omega"-12223334445.5"-high heels

In this particular case, the teip regular expression could be fixed using Oniguruma RE engine which is more powerful, but also slower than the GNU RE engine:

Code:

teip -oGg',\K"-?[0-9]+(,[0-9]+)+(.[0-9]{1,2})?"' -- tr -d ,\" <test.csv

but that's beyond the point I'm trying to get across.

Actually, RFC 4180 states

Quote:

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
[...]
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

So in an RFC 4180-conform CSV that record might look like

Code:

"Code ""Omega""-1",222,333,444,"5.5""-high heels"

but you can't count on that.

I cannot think offhand of any example data that would trip the sed command above, but that doesn't mean they don't exist. All that prevented sed from tripping badly in this example was the first comma at the beginning of expression. Hopefully, this shows why parsing CSV with regular expressions is dangerous.

satyarankireddy · 04-10-2021, 12:30 PM

Quote:

Originally Posted by boughtonp

Also, five minutes before posting here, you posted the same question here: https://www.linux.org/threads/perl-performance-issue.33962

It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)

To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.

shruggy · 04-10-2021, 01:49 PM

For completeness sake, here are solutions in miller

Code:

mlr --csv put '
  func n(n){return is_numeric(n)?fmtnum(float(n),"%.2lf"):gsub(n, ",", "")}
  for(k in mapexcept($*,$[[1]],"Desc")){$[k]=n($[k])}
  ' test.csv

gawk

Code:

gawk -vFPAT='[^,]*|"([^"]|"")+"' '
  $2$3$4~/,/{for(i=2;i<5;i++)gsub(/[",]/,"",$i)}1
  ' test.csv

and gawk-csv

Code:

gawk -icsv -vCSVMODE=1 -vCSVFS=, '
  $2$3$4~/,/{for(i=2;i<5;i++)gsub(/,/,"",$i)}1
  ' test.csv

TB0ne · 04-10-2021, 01:59 PM

Quote:

Originally Posted by satyarankireddy

To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.

Right...except you CAN'T delete you posts on other platforms, anymore than you can delete it here. Funny, you RESPONDED in your other thread, and asked follow up questions.
https://unix.stackexchange.com/quest...formance-issue

And you've been asking about CSV files and parsing for TWO YEARS now. You are essentially using sed on 11 MILLION lines...why are you surprised it's taking a while? And since you've told us NOTHING about your system, the file, disk, CPU, or anything about the Linux you're running, why do you expect us to guess on why it's taking a while, or how to fix the problem???

Write an actual program, using actual parsing utilities if you want better speed. After two years, you should be at least able to do that. And as boughtonp asked, why are you removing the thousand separator in the first place? Isn't for database insertion, since that's easily handled by the DB itself (if the field is defined correctly).

satyarankireddy · 04-11-2021, 12:19 AM

Hello All,

Thanks for your time. I'm closing this thread why bcz working with other folks in another forum.

Thanks

TB0ne · 04-11-2021, 04:45 PM

Quote:

Originally Posted by satyarankireddy

Hello All,
Thanks for your time. I'm closing this thread why bcz working with other folks in another forum.

Thanks

Read the LQ Rules about not using text speak. And you're 'working with other folks' in this case means, "Someone else wrote me a script in another forum", I presume?