[SOLVED] merging *.csv files to remove duplicate lines
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I can get the data for the last year (~252 entries) as a file in the same format. Since I update the data frequently, most of the data in
the update file are duplicate lines. The easy way to do what I want is something like:
Any merge utility should be able to do this. (Even without sorting)
I think one of the simplest (non-gui) which I sometimes use is sdiff utility.
Assuming e.g. your previous file has 100 lines and next update file has 100 lines out of which first 50 lines are duplicate with first file and next 50 are new, you would want 150 lines in merged file, then you can use
$ sdiff stock.csv update.csv -o merged_file.csv
It will give interactive prompt, press enter once on the prompt to see options available to you. Select left and right selections and enter.
Join will take two lines and append them to create one longer line. Not, I think, the desired output.
Older version of sort required the output to already be sorted before the -u unique flag could work properly: modern sort does not require that. I note this because your cat output will NOT be sorted until it is filtered through sort, and you mentioned nothing that would give us the versions of anything being used.
Sort is highly optimized. It is possible that something specifically coded for such a merge and filter might do better, but probably not by a lot.
Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.
SQLite-based tools like q and trdsql don't support full outer join, but some tools based on custom SQL interpreters do
Code:
csvq -n -N -f csv 'select * from a natural full join b'
Although I doubt these would be faster than sort -u.
Interesting! I had not used join that way, and it does work nicely.
Now I must do time trials and test to determine which is faster! I suspect any SQL tools will be slower, but that deserves some testing as well. I will try to get back here with results while we await feedback from the OP.
UPDATE: testing. I generated a file of every other day date , sorted from 1 Jan 1900 to today as file 1. Then every third day as file two, to get a volume of dates woth testing.
1. Using cat and sort as the OP described sorted them in 4.683 Seconds on a Pinebook Pro running current Manjaro for AARCH64.
2. Using sort without cat
Code:
sort -u file1 file2 > files.out
cut the sort time to 0.948 seconds. Lesson, never cat what you can read directly.
3. Same platform, same data, join did the job in 0.384 seconds!
Like: holy rip batman, that was far faster than I expected!
Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.
Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.
Yes, I code. For simple stuff I usually use php, for serious programs I use C/C++. But I don't know an efficient algorithm and I'm doubtful that the effort to start from scratch is worth the effort. I read some post suggesting a hash table; I could read in the smaller file and store it in the table, then read the larger file line by line
and check to see if it's in the table. But with all the suggestions ppl have given here, I think I can find a one-line script to meet my needs. But I'm open to suggestions if you have ideas. Thx
Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.
Okay, fantastic response, thanks guys. I will try sdiff, join, setop, the merge option, etc. BTW, if you want to try your ideas on real data, I'm getting it from finance.yahoo.com as "historic data" for individual stocks. The way I am using the data is via "make". The Makefile calls a series of programs I wrote in PHP and does several things like plot the data into a PNG, compute the Markowitz reward and risk values, moving averages, etc. Nothing fancy, I update the data ery week or so.
So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable.
From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice!
So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable.
From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice!
VERY NICE! The tip about eliminating 'cat' is great, thanks. I'm about to do some testing yself (took a nap) but I think you've made the choice clear.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.