Compare column 1 of two csv files and find the nearest match (>= and < logic)

aachave1 · 07-25-2016, 03:52 PM

Here are some examples (snippets of large files) with files that will always be in this format (no spaces). The first file example has file1 larger than file2 and some of the time stamps match, while some don't. The desired output will have the headers with them, but if too difficult, then I can put them in later. The second file example has file1 smaller than file2, so I need to attach a file1 row that best matches a file2 row (with >= and < logic). As you can see the time in seconds varies from both files so some will match exactly while some are close.

I can easily find all exact time matches, but not > and < comparisons.

Code:

This code (I got from a different forum) kind of works, but does not cover all scenarios and leaves out many of the beginning rows. Iv'e tried rearranging this code to provide the below "Desired" output, but no success.
awk -F, '
BEGIN           {CNT+=2
                }
NR == FNR       {a[NR] = $0
                 b[NR] = $1
                 next
                }
$1 >= b[CNT]    {CNT++
                }
$1 <  b[CNT]    {print a[CNT-1]
                 print $0, RS
                }
' file1 file2


File 1:
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:19,0,0,0,NO_DEF,-2147483647
2016/05/25 16:25:20,0,0,0,NO_DEF,-2147483648
2016/05/25 16:25:21,0,0,0,NO_DEF,-2147483649
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
2016/05/25 16:25:23,0,0,0,NO_DEF,-2147483651
2016/05/25 16:25:24,0,0,0,NO_DEF,-2147483652
2016/05/25 16:25:25,0,0,0,NO_DEF,-2147483653
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655


File 2:

TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW)
2016/05/25 16:25:22,464374526,1464193527,206,0,0
2016/05/25 16:25:26,464374526,1464193532,206,0,0
2016/05/25 16:25:31,464374526,1464193537,207,0,0


Desired Output:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:22,464374526,1464193527,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:26,464374526,1464193532,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:31,464374526,1464193537,207,0,0




Second example:

File 1:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2014/04/07 16:00:30,0,0,0,NO_DEF,-2147483647
2014/04/07 16:00:35,0,0,0,NO_DEF,-2147483648
2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653

File 2:

TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),ID
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:61,0,0,0,0,1,1,544
2014/04/07 16:00:63,0,0,0,0,1,9,544
2014/04/07 16:00:65,0,0,0,0,4,1,544
2014/04/07 16:00:67,0,0,0,0,1,1,544


Output: I prefer the headers to be attached like first output example, but I’ll take this if it is easier.

2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653
2014/04/07 16:00:61,0,0,0,0,1,1,544

Thank you for your time!!

grail · 07-26-2016, 10:09 AM

As you are dealing with time, you will need to convert the times shown into formats that awk can understand and use as actual times to then be able to use comparisons on. I would add that your current solution
is actually comparing 2 strings are equal and not 2 times, but in this case that should not be an issue.

Here you will find the necessary functions for conversion.

aachave1 · 07-26-2016, 02:13 PM

Thanks for the response. I thought that if awk had issues with the time format, it would work if I removed characters and whitespaces in my time stamp, but it didn't help.

WAS: 2016/05/25 16:25:19
IS: 20160525162519

I guess I can't seem to find that perfect awk code to perform this:

Nearest file2 time that is greater than file1 time or equal. So basically, file1 time that is right above a file2 time or equal, then pair them together. NOTE - All times are in ascending order

psuedo example: file2 value >= file1 value AND file2 value < the NEXT file1 value.

I apologize now if I am not relying my issue across properly. This is my first real forum.
Adrian

grail · 07-26-2016, 09:43 PM

You are correct that spacing is not an issue plus removing all the spaces and non-digit characters does give you a string of numbers but still will cause issues with the comparisons you are doing.

Did you look at the page I mentioned? It does supply the required working to do what you are after.

syg00 · 07-26-2016, 10:24 PM

Slight mod should allow the mangled date-time to be used in straight comparisons I would have thought. Better than messing with format strings back and for with the time functions. Store the mangled $1 in the b array rather than the original $1 itself, and do the mangling (of file2 $1) in the comparison tests.

Won't solve the requirement, but should help get on the way. You will also need to handle eof conditions as with any merge type code.

Must admit I'm still trying to fathom what is wanted here.

allend · 07-27-2016, 07:28 AM

Quote:

Output: I prefer the headers to be attached like first output example, but I’ll take this if it is easier.

If you were to remove the header lines from the files ('tail -n +2 file1 > file1a' and 'tail -n +2 file2 > file2a'), then a simple 'sort -m file1a file2a' produces the following output from your second example data set:

Quote:

2014/04/07 16:00:30,0,0,0,NO_DEF,-2147483647
2014/04/07 16:00:35,0,0,0,NO_DEF,-2147483648
2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653
2014/04/07 16:00:61,0,0,0,0,1,1,544
2014/04/07 16:00:63,0,0,0,0,1,9,544
2014/04/07 16:00:65,0,0,0,0,4,1,544
2014/04/07 16:00:67,0,0,0,0,1,1,544

aachave1 · 07-27-2016, 08:53 AM

allend, I am not sure what this sort gets me? Your sample output shows sorted times, but they are not in the correct order as far as a file1 row on top of a file2 row (i.e every other row needs to alternate). Also, I only need the ones that are the closest or match exactly (I NEED every file2 row to be displayed, but not every file1 since sometimes file1 contains many more rows than needed).

The whole concept below is that file 1 contains metadata that needs to precede a file2 row wherever the timestamps match or are the closest (file2 value >= file1 value AND file2 value < the NEXT file1 value). Some file1 rows will be assigned to multiple file2 rows with same timestamp as long as the >= and < logic is met (i.e 16:00:45 and 16:00:50 are used more than once). I would prefer the appropriate headers to precede each row, but that may have to be done later.

2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:45,0,0,0,0,1,3,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544

2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:49,0,0,0,0,4,1,544

2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544

2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:53,0,0,0,0,1,7,544

2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:55,0,0,0,0,8,1,544

allend · 07-29-2016, 03:47 AM

The order of the lines that results from 'sort -m' can be influenced by adding an additional column to the data in each file "sed 's/$.*:..$/\1,1/' file1" and "sed 's/$.*:..$/\1,2./' file2".
Then you simply need to insert lines as appropriate in the merged file. The following could be extended to also add back your desired headers.

Code:

awk -F ',' '{if ($2==1) {a=$0} else {print a; print $0}}'