LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-25-2016, 03:52 PM   #1
aachave1
LQ Newbie
 
Registered: Jul 2016
Posts: 3

Rep: Reputation: Disabled
Compare column 1 of two csv files and find the nearest match (>= and < logic)


Here are some examples (snippets of large files) with files that will always be in this format (no spaces). The first file example has file1 larger than file2 and some of the time stamps match, while some don't. The desired output will have the headers with them, but if too difficult, then I can put them in later. The second file example has file1 smaller than file2, so I need to attach a file1 row that best matches a file2 row (with >= and < logic). As you can see the time in seconds varies from both files so some will match exactly while some are close.

I can easily find all exact time matches, but not > and < comparisons.



Code:
This code (I got from a different forum) kind of works, but does not cover all scenarios and leaves out many of the beginning rows. Iv'e tried rearranging this code to provide the below "Desired" output, but no success.
awk -F, '
BEGIN           {CNT+=2
                }
NR == FNR       {a[NR] = $0
                 b[NR] = $1
                 next
                }
$1 >= b[CNT]    {CNT++
                }
$1 <  b[CNT]    {print a[CNT-1]
                 print $0, RS
                }
' file1 file2


File 1:
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:19,0,0,0,NO_DEF,-2147483647
2016/05/25 16:25:20,0,0,0,NO_DEF,-2147483648
2016/05/25 16:25:21,0,0,0,NO_DEF,-2147483649
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
2016/05/25 16:25:23,0,0,0,NO_DEF,-2147483651
2016/05/25 16:25:24,0,0,0,NO_DEF,-2147483652
2016/05/25 16:25:25,0,0,0,NO_DEF,-2147483653
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655


File 2:

TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW)
2016/05/25 16:25:22,464374526,1464193527,206,0,0
2016/05/25 16:25:26,464374526,1464193532,206,0,0
2016/05/25 16:25:31,464374526,1464193537,207,0,0


Desired Output:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:22,464374526,1464193527,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:26,464374526,1464193532,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:31,464374526,1464193537,207,0,0




Second example:

File 1:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2014/04/07 16:00:30,0,0,0,NO_DEF,-2147483647
2014/04/07 16:00:35,0,0,0,NO_DEF,-2147483648
2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653

File 2:

TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),ID
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:61,0,0,0,0,1,1,544
2014/04/07 16:00:63,0,0,0,0,1,9,544
2014/04/07 16:00:65,0,0,0,0,4,1,544
2014/04/07 16:00:67,0,0,0,0,1,1,544


Output: I prefer the headers to be attached like first output example, but I’ll take this if it is easier.

2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653
2014/04/07 16:00:61,0,0,0,0,1,1,544
Thank you for your time!!

Last edited by aachave1; 07-25-2016 at 07:45 PM.
 
Old 07-26-2016, 10:09 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
As you are dealing with time, you will need to convert the times shown into formats that awk can understand and use as actual times to then be able to use comparisons on. I would add that your current solution
is actually comparing 2 strings are equal and not 2 times, but in this case that should not be an issue.

Here you will find the necessary functions for conversion.
 
Old 07-26-2016, 02:13 PM   #3
aachave1
LQ Newbie
 
Registered: Jul 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thanks for the response. I thought that if awk had issues with the time format, it would work if I removed characters and whitespaces in my time stamp, but it didn't help.

WAS: 2016/05/25 16:25:19
IS: 20160525162519

I guess I can't seem to find that perfect awk code to perform this:

Nearest file2 time that is greater than file1 time or equal. So basically, file1 time that is right above a file2 time or equal, then pair them together. NOTE - All times are in ascending order

psuedo example: file2 value >= file1 value AND file2 value < the NEXT file1 value.


I apologize now if I am not relying my issue across properly. This is my first real forum.
Adrian
 
Old 07-26-2016, 09:43 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
You are correct that spacing is not an issue plus removing all the spaces and non-digit characters does give you a string of numbers but still will cause issues with the comparisons you are doing.

Did you look at the page I mentioned? It does supply the required working to do what you are after.
 
Old 07-26-2016, 10:24 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Slight mod should allow the mangled date-time to be used in straight comparisons I would have thought. Better than messing with format strings back and for with the time functions. Store the mangled $1 in the b array rather than the original $1 itself, and do the mangling (of file2 $1) in the comparison tests.

Won't solve the requirement, but should help get on the way. You will also need to handle eof conditions as with any merge type code.

Must admit I'm still trying to fathom what is wanted here.
 
Old 07-27-2016, 07:28 AM   #6
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,371

Rep: Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749
Quote:
Output: I prefer the headers to be attached like first output example, but I’ll take this if it is easier.
If you were to remove the header lines from the files ('tail -n +2 file1 > file1a' and 'tail -n +2 file2 > file2a'), then a simple 'sort -m file1a file2a' produces the following output from your second example data set:
Quote:
2014/04/07 16:00:30,0,0,0,NO_DEF,-2147483647
2014/04/07 16:00:35,0,0,0,NO_DEF,-2147483648
2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653
2014/04/07 16:00:61,0,0,0,0,1,1,544
2014/04/07 16:00:63,0,0,0,0,1,9,544
2014/04/07 16:00:65,0,0,0,0,4,1,544
2014/04/07 16:00:67,0,0,0,0,1,1,544
 
Old 07-27-2016, 08:53 AM   #7
aachave1
LQ Newbie
 
Registered: Jul 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
allend, I am not sure what this sort gets me? Your sample output shows sorted times, but they are not in the correct order as far as a file1 row on top of a file2 row (i.e every other row needs to alternate). Also, I only need the ones that are the closest or match exactly (I NEED every file2 row to be displayed, but not every file1 since sometimes file1 contains many more rows than needed).

The whole concept below is that file 1 contains metadata that needs to precede a file2 row wherever the timestamps match or are the closest (file2 value >= file1 value AND file2 value < the NEXT file1 value). Some file1 rows will be assigned to multiple file2 rows with same timestamp as long as the >= and < logic is met (i.e 16:00:45 and 16:00:50 are used more than once). I would prefer the appropriate headers to precede each row, but that may have to be done later.

2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:45,0,0,0,0,1,3,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:49,0,0,0,0,4,1,544

2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544

2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:53,0,0,0,0,1,7,544

2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:55,0,0,0,0,8,1,544

Last edited by aachave1; 07-27-2016 at 09:07 AM.
 
Old 07-29-2016, 03:47 AM   #8
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,371

Rep: Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749Reputation: 2749
The order of the lines that results from 'sort -m' can be influenced by adding an additional column to the data in each file "sed 's/\(.*:..\)/\1,1/' file1" and "sed 's/\(.*:..\)/\1,2./' file2".
Then you simply need to insert lines as appropriate in the merged file. The following could be extended to also add back your desired headers.
Code:
awk -F ',' '{if ($2==1) {a=$0} else {print a; print $0}}'
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to print lines in csv file if 1 csv column field = "text". There are 10 column (;) in csv file nexuslinux Linux - Newbie 9 04-22-2016 11:35 PM
Compare two CSV files apoorvajoshi Linux - General 2 09-18-2014 02:31 AM
[SOLVED] Find the null value in the frist column of a csv file Vthimmap Linux - Newbie 8 10-07-2012 01:03 PM
How to read CSV data and compare the column values and then write them in new file VijayaRaghavanLakshman Linux - Newbie 9 01-26-2012 09:02 PM
awk - rearrange column data in csv file to match columns wolverene13 Programming 9 12-21-2011 04:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:22 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration