LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-30-2017, 07:10 PM   #1
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Rep: Reputation: Disabled
Sorting a large CSV-file


I have got a CSV list with 35.000.000 rows of data, where over 10.000.000 are duplicates which I need to remove. I turned to Linux where I for the first time could actually see the data with my own eyes... and then I found out that 'split' could make it possible to make the list editable in Excel (piece by piece). The simple commands 'split' and 'cat' have now helped me to remove almost 10.000.000 duplicates in the list, though I know there must be more. All that I have done is turned 35 files with duplicates into 25 files with unique posts... but put together to one file there will still be duplicates again, when the post that was unique in file 1 meets the post that was unique in file 7 etc.
I'm very(!) new to Linux, but since I so quickly found very simple tools that could take me this far I can't stop wondering if there also are tools to get the list SORTED... so I can open my file pieces and find these remaining duplicates lying next to eachother.

Anyone with an idea how to do this little sorting?
 
Old 12-30-2017, 07:20 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,161

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
You need "sort" - it has an option to sort and only output the first (unique) entry of multiples.
Code:
sort -u input.file > output.file
Simple. See the manpage.
 
2 members found this post helpful.
Old 12-30-2017, 07:32 PM   #3
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,764

Rep: Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225
Contatenate the files, then sort unique:
Code:
cat file1 file2 ... file35 | sort -u > output.file
That will get the duplicates between files.
 
1 members found this post helpful.
Old 12-30-2017, 07:49 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,161

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
Just to expand - I was assuming you were working with your original file, @scasey was using your files after you had run "split". Either will work.
 
Old 12-30-2017, 07:54 PM   #5
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
You need "sort" - it has an option to sort and only output the first (unique) entry of multiples.
Code:
sort -u input.file > output.file
Simple. See the manpage.
Scary! Can it really be that easy? Just tried and obviously it could. I can now see a sorted list loading in Gedit. I've been grinding my teeth for a week doing formulas and macros that has made Excel crash... and one simple command line (and a minute of waiting) fixed it all. Incredible. Why haven't I used this more?

Thank you VERY much for this!!! It was help that meant much to Municipality of Söderhamn in Sweden.

"Manpage"... what is that? Now I'm getting more and more excited about this.
 
Old 12-30-2017, 08:00 PM   #6
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
Just to expand - I was assuming you were working with your original file, @scasey was using your files after you had run "split". Either will work.
You were assuming right. I was aiming for sorting the entire list. But if his way works I'm going to try that too. I want to learn more now, so alternate ways are still appreciated, even though I only need one for the moment.
 
Old 12-30-2017, 08:22 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,161

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
"man" pages are the help system - used from a terminal for any command installed. So a good place to start is "man man" - use "q" (no quotes) to quit and return to the terminal.
 
1 members found this post helpful.
Old 12-31-2017, 12:21 AM   #8
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,764

Rep: Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225Reputation: 2225
syg00: My post was meant to be an expansion on your excellent post. Yes, man is your friend.
Runarsson: Welcome to LQ! There's lots to learn about linux here...enjoy the trip!!
 
Old 12-31-2017, 12:51 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,161

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
All good - hopefully we have helped a new convertee appreciate the usefulness (and power) of the Linux command line.
 
Old 12-31-2017, 04:51 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,116

Rep: Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369Reputation: 7369
Quote:
Originally Posted by syg00 View Post
"man" pages are the help system - used from a terminal for any command installed. So a good place to start is "man man" - use "q" (no quotes) to quit and return to the terminal.
If you wish you can reach the man pages online, see:
https://linux.die.net/man/1/man
https://linux.die.net/man/1/sort
but obviously you can reach them from the command line as it was already explained. Also you can try info instead of man: info sort will also work.

And - unfortunately you cannot find always the solution just by reading man pages, sometimes better to ask...
Welcome here, at LQ.

Last edited by pan64; 12-31-2017 at 05:03 AM.
 
1 members found this post helpful.
Old 12-31-2017, 06:10 AM   #11
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Thank you all of you, both for helping me with problem and for welcoming me here.

I have good experiences of the power of communities when you want to learn something new and the good and QUICK help I got here now is what makes a forum it into a forum that I like. So I'm sure I will be a very frequent visitor, learning more both from asking questions and reading other peoples posts.
 
Old 12-31-2017, 06:17 AM   #12
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Quote:
Originally Posted by Runarsson View Post
Scary! Can it really be that easy? Just tried and obviously it could. I can now see a sorted list loading in Gedit. I've been grinding my teeth for a week doing formulas and macros that has made Excel crash... and one simple command line (and a minute of waiting) fixed it all. Incredible. Why haven't I used this more?
Linux is focused on getting things done. Actually it is a giant toolbox. It doesn't always look simple. But good tools are not always simple.

Windows OTOH is focused on make things look easy. And fails when the scale increases two orders of magnitude. Or if something has not thought of before.

That is why many Windows users think Windows (and the applications) is easy and Linux is complicated. As you know now, it is different.

jlinkels
 
Old 12-31-2017, 06:53 AM   #13
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Runarsson View Post
Thank you all of you, both for helping me with problem and for welcoming me here.

I have good experiences of the power of communities when you want to learn something new and the good and QUICK help I got here now is what makes a forum it into a forum that I like. So I'm sure I will be a very frequent visitor, learning more both from asking questions and reading other peoples posts.
Last time I joined an internet forum for asking questions and learning something new, the help I got lead all the way to the top, where I even was recognized globally. Not exactly my intention here, but a good example of what I mean with "good experiences" (as well as presentation): www.swedneckflyfishing.com/aboutme.htm

Last edited by Runarsson; 12-31-2017 at 06:55 AM.
 
Old 12-31-2017, 08:21 AM   #14
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by jlinkels View Post
Linux is focused on getting things done. Actually it is a giant toolbox. It doesn't always look simple. But good tools are not always simple.

Windows OTOH is focused on make things look easy. And fails when the scale increases two orders of magnitude. Or if something has not thought of before.

That is why many Windows users think Windows (and the applications) is easy and Linux is complicated. As you know now, it is different.

jlinkels
It's really easy to see actually. When installing a new Windows it's practically empty and you have to look up and complete directly from the start. An empty Linux installation isn't empty but comes packed with tools from the start. Another thing is the "looks" of these tools. Everything gives a "cheaper" look of it, both design and often lower level of simplicity. But it's only logical when the creating has to focus on multiple things... like useability, lookability and (of course) sellability. I'm sure the Windows enviroment would look about the same if all the focus has been on pure power.

I have HAD Linux for a long time but the only reason I haven't started to actually use it and look deeper is that I'm so deeply rooted in Windows. In my work I am the guy who help people "getting things done" in Windows and since I'm used to find solutions when I'm facing Windows problems I haven't seen the real need to look elsewhere. This however became a real eye opener... when my own Windows toolbox didn't get me anywhere, but very simple Linux tools took me straight through the wall I had ran into.

I'm sure I will start to use Linux more now. It will not replace my Windows toolbox. But I'm sure that, by learning more, it will contain a lot of shortcuts to get my job done... and this particular job will be an excellent example of it. I was sitting with 25.000.000 rows of data left that I couldn't do more with using my Windows toolbox. The next thing the IT-strategist and I planned was for me to write a new program and then get them to allocate more server power for me just for running it. "Sort -u" in Linux was all that it took to avoid all that hassle. If this doesn't qualify as a definition of the word 'SHORTCUT', then nothing will.

Last edited by Runarsson; 12-31-2017 at 08:35 AM.
 
Old 12-31-2017, 11:36 AM   #15
Runarsson
Member
 
Registered: Dec 2017
Location: Soderhamn, Sweden
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35

Original Poster
Rep: Reputation: Disabled
Ok, it seems like I was a little bit ahead of myself with this list. The 'sort -u' turned out to be a tool for the second step. Step one was actually identifying these duplicates and list the data and the time windows in which they occured... and in the NEXT step remove the time column and getting rid of the data duplicates.

'Sort -g' looked like it gave me a more sorted list than the original and a file size that was the same. But I still want to double check with the ones who know more (i.e. you). Is 'sort -g' the one to use for ONLY sorting the list?

(As feedback I can let you know that your help made me able to remove another 2.500.000 duplicates... which is about 2.300.000 more than I had expected/hoped for. So believe me when I say that this help made a big difference and was greatly appreciated.)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to print lines in csv file if 1 csv column field = "text". There are 10 column (;) in csv file nexuslinux Linux - Newbie 9 04-22-2016 11:35 PM
[SOLVED] A challenging script - Replace field of CSV file based on another CSV file arbex5 Programming 11 06-12-2013 06:56 AM
[SOLVED] Sorting a csv file school project Deke602 Linux - Newbie 8 10-05-2012 10:16 AM
Append all CSV files in a directory into one large file br8kwall Programming 2 04-19-2008 07:44 AM
sorting through large file directories n_hendrick Linux - General 4 05-08-2007 12:08 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration