LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-27-2021, 08:31 AM   #1
rdx
Member
 
Registered: Apr 2007
Location: Dallas
Distribution: Slackware64 14.2
Posts: 283

Rep: Reputation: 25
merging *.csv files to remove duplicate lines


Given: a .csv file of stock market values for a particular stock,
each line has fields for:
Code:
year-month-day,open,high,low,close, adj close,volume
I can get the data for the last year (~252 entries) as a file in the same format. Since I update the data frequently, most of the data in
the update file are duplicate lines. The easy way to do what I want is something like:
Code:
cat stock.csv update.csv | sort -u > updated_stock.csv
Note: the input files are both sorted by date, which is perfect, and exactly what I want on the output.

Question: Is there a more efficient way to merge these files given the parameters?
 
Old 03-27-2021, 08:52 AM   #2
s.verma
Member
 
Registered: Oct 2013
Distribution: Debian Sid, Gentoo, Arch, Debian
Posts: 186
Blog Entries: 4

Rep: Reputation: 25
Any merge utility should be able to do this. (Even without sorting)

I think one of the simplest (non-gui) which I sometimes use is sdiff utility.
Assuming e.g. your previous file has 100 lines and next update file has 100 lines out of which first 50 lines are duplicate with first file and next 50 are new, you would want 150 lines in merged file, then you can use
$ sdiff stock.csv update.csv -o merged_file.csv

It will give interactive prompt, press enter once on the prompt to see options available to you. Select left and right selections and enter.

Hope that helps.
 
1 members found this post helpful.
Old 03-27-2021, 08:55 AM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
If the input files are already sorted then join may be more efficient than sort -u:
Code:
join -t '' -a1 -a2 stock.csv update.csv
A less known alternative to join is setop.

That said, there are several tools specifically designed to work with CSV data and also capable of performing join on them. E.g, miller.

Last edited by shruggy; 03-27-2021 at 11:03 AM.
 
2 members found this post helpful.
Old 03-27-2021, 09:12 AM   #4
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Join will take two lines and append them to create one longer line. Not, I think, the desired output.

Older version of sort required the output to already be sorted before the -u unique flag could work properly: modern sort does not require that. I note this because your cat output will NOT be sorted until it is filtered through sort, and you mentioned nothing that would give us the versions of anything being used.


Sort is highly optimized. It is possible that something specifically coded for such a merge and filter might do better, but probably not by a lot.
Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.
 
Old 03-27-2021, 09:35 AM   #5
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by wpeckham View Post
Join will take two lines and append them to create one longer line. Not, I think, the desired output.
Nope. Not in this case.
Code:
$ cat a
a
c
d
f
Code:
$ cat b
b
c
d
e
Code:
$ join -t '' -a1 -a2 a b
a
b
c
d
e
f
The same result could be achieved with
Code:
setop a b
and
Code:
mlr --csv -N join --ul --ur -s -j 1 -f a b
There are also several SQL-based tools that allow doing SQL queries on CSV data. E.g.
Code:
q -c1 'select * from a union select * from b'
or
Code:
trdsql 'select * from a union select * from b'
SQLite-based tools like q and trdsql don't support full outer join, but some tools based on custom SQL interpreters do
Code:
csvq -n -N -f csv 'select * from a natural full join b'
Although I doubt these would be faster than sort -u.

Last edited by shruggy; 03-27-2021 at 03:54 PM.
 
1 members found this post helpful.
Old 03-27-2021, 01:49 PM   #6
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Quote:
Originally Posted by shruggy View Post
Nope. Not in this case.
Code:
$ cat a
a
c
d
f
Code:
$ cat b
b
c
d
e
Code:
$ join -t '' -a1 -a2 a b
a
b
c
d
e
f
The same result could be achieved with
Code:
setop a b
and
Code:
mlr --csv -N join --ul --ur -s -j 1 -f a b
There are also several SQL-based tools that allow doing SQL queries on CSV data. E.g.
Code:
q -c1 'select * from a union select * from b'
or
Code:
trdsql 'select * from a union select * from b'
SQLite-based tools like q and trdsql don't support full outer join, but some tools based on custom SQL interpreters do
Code:
csvq -n -N -f csv 'select * from a natural full join b'
Although I doubt these would be faster than sort -u.
Interesting! I had not used join that way, and it does work nicely.

Now I must do time trials and test to determine which is faster! I suspect any SQL tools will be slower, but that deserves some testing as well. I will try to get back here with results while we await feedback from the OP.



UPDATE: testing. I generated a file of every other day date , sorted from 1 Jan 1900 to today as file 1. Then every third day as file two, to get a volume of dates woth testing.
1. Using cat and sort as the OP described sorted them in 4.683 Seconds on a Pinebook Pro running current Manjaro for AARCH64.

2. Using sort without cat
Code:
sort -u file1 file2 > files.out
cut the sort time to 0.948 seconds. Lesson, never cat what you can read directly.

3. Same platform, same data, join did the job in 0.384 seconds!
Like: holy rip batman, that was far faster than I expected!

Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.

Last edited by wpeckham; 03-27-2021 at 02:09 PM.
 
1 members found this post helpful.
Old 03-28-2021, 01:03 AM   #7
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,798

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
sort has a -m option for merge.
See
Code:
man sort
 
2 members found this post helpful.
Old 03-28-2021, 04:38 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,130

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Quote:
Originally Posted by wpeckham View Post
Your mileage may vary.
Hmmm - as an absolute minimum you (royal plural) need to invoke drop_caches prior to each (and every) run.
 
Old 03-28-2021, 07:30 AM   #9
rdx
Member
 
Registered: Apr 2007
Location: Dallas
Distribution: Slackware64 14.2
Posts: 283

Original Poster
Rep: Reputation: 25
Quote:
Originally Posted by wpeckham View Post
Do you code? Such a filter might be created using v(vlang), Go, Rust, Perl, Python, C, C++, assembler, or Pascal quite easily.
Yes, I code. For simple stuff I usually use php, for serious programs I use C/C++. But I don't know an efficient algorithm and I'm doubtful that the effort to start from scratch is worth the effort. I read some post suggesting a hash table; I could read in the smaller file and store it in the table, then read the larger file line by line
and check to see if it's in the table. But with all the suggestions ppl have given here, I think I can find a one-line script to meet my needs. But I'm open to suggestions if you have ideas. Thx
 
Old 03-28-2021, 07:38 AM   #10
rdx
Member
 
Registered: Apr 2007
Location: Dallas
Distribution: Slackware64 14.2
Posts: 283

Original Poster
Rep: Reputation: 25
Quote:
Originally Posted by syg00 View Post
Hmmm - as an absolute minimum you (royal plural) need to invoke drop_caches prior to each (and every) run.
What does this mean? I have no idea what, when, how this applies. Please expand on your comment.
 
Old 03-28-2021, 07:40 AM   #11
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
It was a reply to
Quote:
Originally Posted by wpeckham View Post
Disclaimer: for publication I would have run tests an other hardware, and run a few thousand tests of each command, and averaged the results. The platforms during testing would have been prepped by shutting down all background processes and eliminating any GUI drain on the CPU or I/O buffers. This was just a quick and dirty compare. Your mileage may vary.
 
Old 03-28-2021, 07:47 AM   #12
rdx
Member
 
Registered: Apr 2007
Location: Dallas
Distribution: Slackware64 14.2
Posts: 283

Original Poster
Rep: Reputation: 25
Okay, fantastic response, thanks guys. I will try sdiff, join, setop, the merge option, etc. BTW, if you want to try your ideas on real data, I'm getting it from finance.yahoo.com as "historic data" for individual stocks. The way I am using the data is via "make". The Makefile calls a series of programs I wrote in PHP and does several things like plot the data into a PNG, compute the Markowitz reward and risk values, moving averages, etc. Nothing fancy, I update the data ery week or so.

Thanks again for all your ideas.

Last edited by rdx; 03-28-2021 at 08:12 AM.
 
Old 03-28-2021, 08:17 AM   #13
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Interestingly, recently there was another thread about these data.
 
1 members found this post helpful.
Old 03-28-2021, 09:38 AM   #14
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Some better testing using three trial runs with larger data structure on the same hardware:
sort -u file1 file2 > file3 : 34.955 36.286 35.799

Sort -m is unsuitable alone, as it does not eliminate duplicates, but is VERY fast.
sort -u -m a b > c : 8.783 8.801 9.397

join -t " " -a1 -a2 file1 file2 > file3 : 12.691 13.424 12.749

So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable.

From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice!
 
1 members found this post helpful.
Old 03-28-2021, 02:07 PM   #15
rdx
Member
 
Registered: Apr 2007
Location: Dallas
Distribution: Slackware64 14.2
Posts: 283

Original Poster
Rep: Reputation: 25
Quote:
Originally Posted by wpeckham View Post
Some better testing using three trial runs with larger data structure on the same hardware:
sort -u file1 file2 > file3 : 34.955 36.286 35.799

Sort -m is unsuitable alone, as it does not eliminate duplicates, but is VERY fast.
sort -u -m a b > c : 8.783 8.801 9.397

join -t " " -a1 -a2 file1 file2 > file3 : 12.691 13.424 12.749

So it looks to me as if sort is the best of these three options, but only if you feed it properly. Join is a nice second choice. Either is much faster than anything involving sorting the already sorted lists or anything (so far) involving cat or another second executable.

From the original code, using sort this way eliminates cat and reduces execution of this step to only 25% to 33% (roughly) of the original execution time. Nice!
VERY NICE! The tip about eliminating 'cat' is great, thanks. I'm about to do some testing yself (took a nap) but I think you've made the choice clear.
 
  


Reply

Tags
csv



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
For multiple csv files how do I add the value of one particular entry in any given csv to that csv's name? sean mckinney Programming 8 01-22-2021 09:46 AM
LXer: How To Empty a File, Delete N Lines From a File, Remove Matching String From a File, And Remove Empty/Blank Lines From a File In Linux LXer Syndicated Linux News 0 11-22-2017 12:30 PM
How to print lines in csv file if 1 csv column field = "text". There are 10 column (;) in csv file nexuslinux Linux - Newbie 9 04-22-2016 11:35 PM
[SOLVED] How to script csv editing? Remove rows from csv file that do not contain certain text ingram87 Linux - Software 9 08-03-2012 12:45 PM
CSV file - Duplicate records need merging | BASH? lmedland Programming 21 12-10-2010 05:01 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 11:55 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration