LinuxQuestions.org - [SOLVED] merging *.csv files to remove duplicate lines

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - merging *.csv files to remove duplicate lines (https://www.linuxquestions.org/questions/linux-general-1/merging-%2A-csv-files-to-remove-duplicate-lines-4175692727/)

rdx	03-27-2021 08:31 AM

merging *.csv files to remove duplicate lines

Given: a .csv file of stock market values for a particular stock,
each line has fields for:

Code:

year-month-day,open,high,low,close, adj close,volume

I can get the data for the last year (~252 entries) as a file in the same format. Since I update the data frequently, most of the data in
the update file are duplicate lines. The easy way to do what I want is something like:

Code:

cat stock.csv update.csv | sort -u > updated_stock.csv

Note: the input files are both sorted by date, which is perfect, and exactly what I want on the output.

Question: Is there a more efficient way to merge these files given the parameters?

s.verma

03-27-2021 08:52 AM

Any merge utility should be able to do this. (Even without sorting)

I think one of the simplest (non-gui) which I sometimes use is sdiff utility.
Assuming e.g. your previous file has 100 lines and next update file has 100 lines out of which first 50 lines are duplicate with first file and next 50 are new, you would want 150 lines in merged file, then you can use
$ sdiff stock.csv update.csv -o merged_file.csv

It will give interactive prompt, press enter once on the prompt to see options available to you. Select left and right selections and enter.

Hope that helps.

shruggy

03-27-2021 08:55 AM

If the input files are already sorted then join may be more efficient than sort -u:

Code:

join -t '' -a1 -a2 stock.csv update.csv

A less known alternative to join is setop.

That said, there are several tools specifically designed to work with CSV data and also capable of performing join on them. E.g, miller.

wpeckham

03-27-2021 09:12 AM

Join will take two lines and append them to create one longer line. Not, I think, the desired output.

Older version of sort required the output to already be sorted before the -u unique flag could work properly: modern sort does not require that. I note this because your cat output will NOT be sorted until it is filtered through sort, and you mentioned nothing that would give us the versions of anything being used.

Sort is highly optimized. It is possible that something specifically coded for such a merge and filter might do better, but probably not by a lot.
Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.

shruggy

03-27-2021 09:35 AM

Quote:

Originally Posted by wpeckham (Post 6234667)

Join will take two lines and append them to create one longer line. Not, I think, the desired output.

Nope. Not in this case.

Code:

$ cat a

a

c

d

f

Code:

$ cat b

b

c

d

e

Code:

$ join -t '' -a1 -a2 a b

a

b

c

d

e

f

The same result could be achieved with

Code:

setop a b

and

Code:

mlr --csv -N join --ul --ur -s -j 1 -f a b

There are also several SQL-based tools that allow doing SQL queries on CSV data. E.g.

Code:

q -c1 'select * from a union select * from b'

Code:

trdsql 'select * from a union select * from b'

SQLite-based tools like q and trdsql don't support full outer join, but some tools based on custom SQL interpreters do

Code:

csvq -n -N -f csv 'select * from a natural full join b'

Although I doubt these would be faster than sort -u.

wpeckham

03-27-2021 01:49 PM

Quote:

Originally Posted by shruggy (Post 6234675)

Nope. Not in this case.

Code:

$ cat a

a

c

d

f

Code:

$ cat b

b

c

d

e

Code:

$ join -t '' -a1 -a2 a b

a

b

c

d

e

f

The same result could be achieved with

Code:

setop a b

and

Code:

mlr --csv -N join --ul --ur -s -j 1 -f a b

There are also several SQL-based tools that allow doing SQL queries on CSV data. E.g.

Code:

q -c1 'select * from a union select * from b'

Code:

trdsql 'select * from a union select * from b'

SQLite-based tools like q and trdsql don't support full outer join, but some tools based on custom SQL interpreters do

Code:

csvq -n -N -f csv 'select * from a natural full join b'

Although I doubt these would be faster than sort -u.

Interesting! I had not used join that way, and it does work nicely.

Now I must do time trials and test to determine which is faster! I suspect any SQL tools will be slower, but that deserves some testing as well. I will try to get back here with results while we await feedback from the OP.

UPDATE: testing. I generated a file of every other day date , sorted from 1 Jan 1900 to today as file 1. Then every third day as file two, to get a volume of dates woth testing.
1. Using cat and sort as the OP described sorted them in 4.683 Seconds on a Pinebook Pro running current Manjaro for AARCH64.

2. Using sort without cat

Code:

sort -u file1 file2 > files.out

cut the sort time to 0.948 seconds. Lesson, never cat what you can read directly.

3. Same platform, same data, join did the job in 0.384 seconds!
Like: holy rip batman, that was far faster than I expected!

Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.

MadeInGermany

03-28-2021 01:03 AM

sort has a -m option for merge.
See

Code:

man sort

syg00

03-28-2021 04:38 AM

Quote:

Originally Posted by wpeckham (Post 6234765)

Your mileage may vary.

Hmmm - as an absolute minimum you (royal plural) need to invoke drop_caches prior to each (and every) run.

rdx	03-28-2021 07:30 AM

Quote:

Originally Posted by wpeckham (Post 6234667)

Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.

Yes, I code. For simple stuff I usually use php, for serious programs I use C/C++. But I don't know an efficient algorithm and I'm doubtful that the effort to start from scratch is worth the effort. I read some post suggesting a hash table; I could read in the smaller file and store it in the table, then read the larger file line by line
and check to see if it's in the table. But with all the suggestions ppl have given here, I think I can find a one-line script to meet my needs. But I'm open to suggestions if you have ideas. Thx

rdx	03-28-2021 07:38 AM

Quote:

Originally Posted by syg00 (Post 6234950)

Hmmm - as an absolute minimum you (royal plural) need to invoke drop_caches prior to each (and every) run.

What does this mean? I have no idea what, when, how this applies. Please expand on your comment.

shruggy

03-28-2021 07:40 AM

It was a reply to

Quote:

Originally Posted by wpeckham (Post 6234765)

Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.

rdx	03-28-2021 07:47 AM

Okay, fantastic response, thanks guys. I will try sdiff, join, setop, the merge option, etc. BTW, if you want to try your ideas on real data, I'm getting it from finance.yahoo.com as "historic data" for individual stocks. The way I am using the data is via "make". The Makefile calls a series of programs I wrote in PHP and does several things like plot the data into a PNG, compute the Markowitz reward and risk values, moving averages, etc. Nothing fancy, I update the data ery week or so.

Thanks again for all your ideas.

shruggy

03-28-2021 08:17 AM

Interestingly, recently there was another thread about these data.

wpeckham

03-28-2021 09:38 AM

Some better testing using three trial runs with larger data structure on the same hardware:
sort -u file1 file2 > file3 : 34.955 36.286 35.799

Sort -m is unsuitable alone, as it does not eliminate duplicates, but is VERY fast.
sort -u -m a b > c : 8.783 8.801 9.397

join -t " " -a1 -a2 file1 file2 > file3 : 12.691 13.424 12.749

So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable.

From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice!

rdx	03-28-2021 02:07 PM

Quote:

Originally Posted by wpeckham (Post 6235009)

VERY NICE! The tip about eliminating 'cat' is great, thanks. I'm about to do some testing yself (took a nap) but I think you've made the choice clear.

All times are GMT -5. The time now is 04:58 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page