Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Rep:
Sorting a large CSV-file
I have got a CSV list with 35.000.000 rows of data, where over 10.000.000 are duplicates which I need to remove. I turned to Linux where I for the first time could actually see the data with my own eyes... and then I found out that 'split' could make it possible to make the list editable in Excel (piece by piece). The simple commands 'split' and 'cat' have now helped me to remove almost 10.000.000 duplicates in the list, though I know there must be more. All that I have done is turned 35 files with duplicates into 25 files with unique posts... but put together to one file there will still be duplicates again, when the post that was unique in file 1 meets the post that was unique in file 7 etc.
I'm very(!) new to Linux, but since I so quickly found very simple tools that could take me this far I can't stop wondering if there also are tools to get the list SORTED... so I can open my file pieces and find these remaining duplicates lying next to eachother.
Anyone with an idea how to do this little sorting?
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Quote:
Originally Posted by syg00
You need "sort" - it has an option to sort and only output the first (unique) entry of multiples.
Code:
sort -u input.file > output.file
Simple. See the manpage.
Scary! Can it really be that easy? Just tried and obviously it could. I can now see a sorted list loading in Gedit. I've been grinding my teeth for a week doing formulas and macros that has made Excel crash... and one simple command line (and a minute of waiting) fixed it all. Incredible. Why haven't I used this more?
Thank you VERY much for this!!! It was help that meant much to Municipality of Söderhamn in Sweden.
"Manpage"... what is that? Now I'm getting more and more excited about this.
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Quote:
Originally Posted by syg00
Just to expand - I was assuming you were working with your original file, @scasey was using your files after you had run "split". Either will work.
You were assuming right. I was aiming for sorting the entire list. But if his way works I'm going to try that too. I want to learn more now, so alternate ways are still appreciated, even though I only need one for the moment.
"man" pages are the help system - used from a terminal for any command installed. So a good place to start is "man man" - use "q" (no quotes) to quit and return to the terminal.
syg00: My post was meant to be an expansion on your excellent post. Yes, man is your friend.
Runarsson: Welcome to LQ! There's lots to learn about linux here...enjoy the trip!!
"man" pages are the help system - used from a terminal for any command installed. So a good place to start is "man man" - use "q" (no quotes) to quit and return to the terminal.
If you wish you can reach the man pages online, see: https://linux.die.net/man/1/man https://linux.die.net/man/1/sort
but obviously you can reach them from the command line as it was already explained. Also you can try info instead of man: info sort will also work.
And - unfortunately you cannot find always the solution just by reading man pages, sometimes better to ask...
Welcome here, at LQ.
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Thank you all of you, both for helping me with problem and for welcoming me here.
I have good experiences of the power of communities when you want to learn something new and the good and QUICK help I got here now is what makes a forum it into a forum that I like. So I'm sure I will be a very frequent visitor, learning more both from asking questions and reading other peoples posts.
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195
Rep:
Quote:
Originally Posted by Runarsson
Scary! Can it really be that easy? Just tried and obviously it could. I can now see a sorted list loading in Gedit. I've been grinding my teeth for a week doing formulas and macros that has made Excel crash... and one simple command line (and a minute of waiting) fixed it all. Incredible. Why haven't I used this more?
Linux is focused on getting things done. Actually it is a giant toolbox. It doesn't always look simple. But good tools are not always simple.
Windows OTOH is focused on make things look easy. And fails when the scale increases two orders of magnitude. Or if something has not thought of before.
That is why many Windows users think Windows (and the applications) is easy and Linux is complicated. As you know now, it is different.
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Quote:
Originally Posted by Runarsson
Thank you all of you, both for helping me with problem and for welcoming me here.
I have good experiences of the power of communities when you want to learn something new and the good and QUICK help I got here now is what makes a forum it into a forum that I like. So I'm sure I will be a very frequent visitor, learning more both from asking questions and reading other peoples posts.
Last time I joined an internet forum for asking questions and learning something new, the help I got lead all the way to the top, where I even was recognized globally. Not exactly my intention here, but a good example of what I mean with "good experiences" (as well as presentation): www.swedneckflyfishing.com/aboutme.htm
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Quote:
Originally Posted by jlinkels
Linux is focused on getting things done. Actually it is a giant toolbox. It doesn't always look simple. But good tools are not always simple.
Windows OTOH is focused on make things look easy. And fails when the scale increases two orders of magnitude. Or if something has not thought of before.
That is why many Windows users think Windows (and the applications) is easy and Linux is complicated. As you know now, it is different.
jlinkels
It's really easy to see actually. When installing a new Windows it's practically empty and you have to look up and complete directly from the start. An empty Linux installation isn't empty but comes packed with tools from the start. Another thing is the "looks" of these tools. Everything gives a "cheaper" look of it, both design and often lower level of simplicity. But it's only logical when the creating has to focus on multiple things... like useability, lookability and (of course) sellability. I'm sure the Windows enviroment would look about the same if all the focus has been on pure power.
I have HAD Linux for a long time but the only reason I haven't started to actually use it and look deeper is that I'm so deeply rooted in Windows. In my work I am the guy who help people "getting things done" in Windows and since I'm used to find solutions when I'm facing Windows problems I haven't seen the real need to look elsewhere. This however became a real eye opener... when my own Windows toolbox didn't get me anywhere, but very simple Linux tools took me straight through the wall I had ran into.
I'm sure I will start to use Linux more now. It will not replace my Windows toolbox. But I'm sure that, by learning more, it will contain a lot of shortcuts to get my job done... and this particular job will be an excellent example of it. I was sitting with 25.000.000 rows of data left that I couldn't do more with using my Windows toolbox. The next thing the IT-strategist and I planned was for me to write a new program and then get them to allocate more server power for me just for running it. "Sort -u" in Linux was all that it took to avoid all that hassle. If this doesn't qualify as a definition of the word 'SHORTCUT', then nothing will.
Distribution: Mint Cinnamon and LXLE + VirtualBox bunch
Posts: 35
Original Poster
Rep:
Ok, it seems like I was a little bit ahead of myself with this list. The 'sort -u' turned out to be a tool for the second step. Step one was actually identifying these duplicates and list the data and the time windows in which they occured... and in the NEXT step remove the time column and getting rid of the data duplicates.
'Sort -g' looked like it gave me a more sorted list than the original and a file size that was the same. But I still want to double check with the ones who know more (i.e. you). Is 'sort -g' the one to use for ONLY sorting the list?
(As feedback I can let you know that your help made me able to remove another 2.500.000 duplicates... which is about 2.300.000 more than I had expected/hoped for. So believe me when I say that this help made a big difference and was greatly appreciated.)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.