Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I've used the following command to remove special characters in number columns (.CSV file) and it is working fine as excepted but the issue here is performance. My CSV file number column data contains as below. To remove 1000 separator comma in data I've used the following Perl command.
Why are you trying to remove the thousand separator in the numbers?
Often people try to do that when they're trying to handle CSV formatted data without using a CSV parser - there are plenty of flexible CLI CSV parsing tools that will avoid needing to jump through hoops like this.
(Also, why are you piping "cat" into the perl command?)
It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)
As boughtonp said, use one (or several) of the tools from the list to work directly with your data.
Another possibility is to convert CSV to another format before processing the data. Many tools from the list above (and even more from not listed there) are capable of converting CSV to TSV (tab separated values) which is easier to work with. But there are other options as well (e.g. csvquote).
That said, there are options specifically tailored to the task at hand. Like
Code:
csvfix number -smq -f 2,3,4 test.csv
Or
Code:
csvtk -l replace -f2-4 -p'[",]' -r '' <test.csv
Or (less reliable, but blazingly fast):
Code:
csvquote test.csv|tr -d '\37'|csvquote -u
It doesn't remove double quotes though. A better take on this would be preselecting relevant fields with teip before feeding them to tr:
But it would be less safe as well. The problem with the last two commands is that there are many CSV dialects in the wild, not all of them strictly following RFC 4180. In some CSV dialects, it's entirely possible to have organization names (the first field) like say Code "Omega"-1 and them be unquoted. Besides, the last field (Desc) could also include numbers and quotes and be unquoted. Consider records like
Code:
Code "Omega"-1,222,333,444,5.5"-high heels
The last command would mangle it to
Code:
Code "Omega"-12223334445.5"-high heels
In this particular case, the teip regular expression could be fixed using Oniguruma RE engine which is more powerful, but also slower than the GNU RE engine:
but that's beyond the point I'm trying to get across.
Actually, RFC 4180 states
Quote:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
[...]
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
So in an RFC 4180-conform CSV that record might look like
Code:
"Code ""Omega""-1",222,333,444,"5.5""-high heels"
but you can't count on that.
I cannot think offhand of any example data that would trip the sed command above, but that doesn't mean they don't exist. All that prevented sed from tripping badly in this example was the first comma at the beginning of expression. Hopefully, this shows why parsing CSV with regular expressions is dangerous.
It's rude to cross-post without at the very least saying you've done so and providing the URL of the other thread.
(When you don't do that, people responding may waste time repeating what others have already said.)
To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.
To get a quick suggestion, I've posted the same queries on multiple platforms. I'm going to delete my post on other platforms. Anyway, you guys help me with this. I don't want spoiled other folks time.
Right...except you CAN'T delete you posts on other platforms, anymore than you can delete it here. Funny, you RESPONDED in your other thread, and asked follow up questions. https://unix.stackexchange.com/quest...formance-issue
And you've been asking about CSV files and parsing for TWO YEARS now. You are essentially using sed on 11 MILLION lines...why are you surprised it's taking a while? And since you've told us NOTHING about your system, the file, disk, CPU, or anything about the Linux you're running, why do you expect us to guess on why it's taking a while, or how to fix the problem???
Write an actual program, using actual parsing utilities if you want better speed. After two years, you should be at least able to do that. And as boughtonp asked, why are you removing the thousand separator in the first place? Isn't for database insertion, since that's easily handled by the DB itself (if the field is defined correctly).
Hello All,
Thanks for your time. I'm closing this thread why bcz working with other folks in another forum.
Thanks
Read the LQ Rules about not using text speak. And you're 'working with other folks' in this case means, "Someone else wrote me a script in another forum", I presume?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.