merging *.csv files to remove duplicate lines
Given: a .csv file of stock market values for a particular stock,
each line has fields for: Code:
year-month-day,open,high,low,close, adj close,volume the update file are duplicate lines. The easy way to do what I want is something like: Code:
cat stock.csv update.csv | sort -u > updated_stock.csv Question: Is there a more efficient way to merge these files given the parameters? |
Any merge utility should be able to do this. (Even without sorting)
I think one of the simplest (non-gui) which I sometimes use is sdiff utility. Assuming e.g. your previous file has 100 lines and next update file has 100 lines out of which first 50 lines are duplicate with first file and next 50 are new, you would want 150 lines in merged file, then you can use $ sdiff stock.csv update.csv -o merged_file.csv It will give interactive prompt, press enter once on the prompt to see options available to you. Select left and right selections and enter. Hope that helps. |
If the input files are already sorted then join may be more efficient than sort -u:
Code:
join -t '' -a1 -a2 stock.csv update.csv That said, there are several tools specifically designed to work with CSV data and also capable of performing join on them. E.g, miller. |
Join will take two lines and append them to create one longer line. Not, I think, the desired output.
Older version of sort required the output to already be sorted before the -u unique flag could work properly: modern sort does not require that. I note this because your cat output will NOT be sorted until it is filtered through sort, and you mentioned nothing that would give us the versions of anything being used. Sort is highly optimized. It is possible that something specifically coded for such a merge and filter might do better, but probably not by a lot. Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily. |
Quote:
Code:
$ cat a Code:
$ cat b Code:
$ join -t '' -a1 -a2 a b Code:
setop a b Code:
mlr --csv -N join --ul --ur -s -j 1 -f a b Code:
q -c1 'select * from a union select * from b' Code:
trdsql 'select * from a union select * from b' Code:
csvq -n -N -f csv 'select * from a natural full join b' |
Quote:
Now I must do time trials and test to determine which is faster! I suspect any SQL tools will be slower, but that deserves some testing as well. I will try to get back here with results while we await feedback from the OP. UPDATE: testing. I generated a file of every other day date , sorted from 1 Jan 1900 to today as file 1. Then every third day as file two, to get a volume of dates woth testing. 1. Using cat and sort as the OP described sorted them in 4.683 Seconds on a Pinebook Pro running current Manjaro for AARCH64. 2. Using sort without cat Code:
sort -u file1 file2 > files.out 3. Same platform, same data, join did the job in 0.384 seconds! Like: holy rip batman, that was far faster than I expected! Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary. |
sort has a -m option for merge.
See Code:
man sort |
Quote:
|
Quote:
and check to see if it's in the table. But with all the suggestions ppl have given here, I think I can find a one-line script to meet my needs. But I'm open to suggestions if you have ideas. Thx |
Quote:
|
It was a reply to
Quote:
|
Okay, fantastic response, thanks guys. I will try sdiff, join, setop, the merge option, etc. BTW, if you want to try your ideas on real data, I'm getting it from finance.yahoo.com as "historic data" for individual stocks. The way I am using the data is via "make". The Makefile calls a series of programs I wrote in PHP and does several things like plot the data into a PNG, compute the Markowitz reward and risk values, moving averages, etc. Nothing fancy, I update the data ery week or so.
Thanks again for all your ideas. |
Interestingly, recently there was another thread about these data.
|
Some better testing using three trial runs with larger data structure on the same hardware:
sort -u file1 file2 > file3 : 34.955 36.286 35.799 Sort -m is unsuitable alone, as it does not eliminate duplicates, but is VERY fast. sort -u -m a b > c : 8.783 8.801 9.397 join -t " " -a1 -a2 file1 file2 > file3 : 12.691 13.424 12.749 So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable. From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice! |
Quote:
|
All times are GMT -5. The time now is 04:58 AM. |