file comparison of huge files

kaaliakahn · 01-06-2012, 11:42 AM

Hi all,
This is my first post to the forum. I am very happy to see your contribution. I am eager to become part of it.

I have the following question. I have two huge files to compare (almost 3GB each). The files are simulation outputs. The format of the files are as below

(please also see attached figure if the following table is confusing)

File 1 File 2
----------------- -------------------
Time sig_name sig_val Time sig_name sig_val

0ns sig1 0 0ns sig1 0
0ns sig2 0 0ns sig2 1
0ns sig3 0 0ns sig3 1
0ns sig4 1 0ns sig4 1
1ns sig1 0 1ns sig1 0
1ns sig2 0 1ns sig2 0
1ns sig3 0 1ns sig3 0
1ns sig4 0 1ns sig4 0
2ns sig1 0 2ns sig1 1
2ns sig2 0 2ns sig2 0
2ns sig3 0 2ns sig3 0
2ns sig4 0 2ns sig4 0
3ns sig1 1 3ns sig1 0
3ns sig2 0 3ns sig2 1
3ns sig3 0 3ns sig3 0
3ns sig4 0 3ns sig4 0

Given the two files in the above format, how can i print out the following table from "efficient" file comparison. Efficiency is required as file size is over 3GB

signal number_of_mismatches time_of_mismatch
------ -------------------- ----------------
sig1 2 2ns, 3ns
sig2 2 0ns, 3ns
sig3 1 0ns
sig4 0

I shall really appreciate your response.

thesnow · 01-06-2012, 12:31 PM

Do you have the option to import the data into a DBMS and write SQL to do the comparison?

kaaliakahn · 01-06-2012, 01:53 PM

No there is no option to do SQL.

I believe this should be pretty straight with grep, sed and awk?

Any expert's opinion?

TB0ne · 01-06-2012, 02:46 PM

Quote:

Originally Posted by kaaliakahn

No there is no option to do SQL.
I believe this should be pretty straight with grep, sed and awk?
Any expert's opinion?

Yes, there IS an option to use SQL. If you have the data in a text file, write a quick little program to read it, and import it into a MySQL (or whatever) database/table. Perform your operations on it there. That's the quickest way to do it. You *CAN* stick something together with grep, etc., but for 3GB files, it'll take a good bit of time to work, although the script itself may be simple.

If this is something you have to do every now and then, writing the import routine would be the best option. Plus, you then have more flexibility in what you can look at, and how. If it's SQL, you can hit it with Excel/OpenOffice, get charts/graphs, and do ad-hoc reporting very easily.

Dark_Helmet · 01-06-2012, 04:25 PM

Quote:

Originally Posted by TB0ne

Yes, there IS an option to use SQL

Not necessarily. Maybe he does not have administrative rights to install a SQL server (or any other software for that matter). It may not be cost effective for them to install SQL support this single task. There could be any number of reasons why SQL is not an option.

I do agree that pipelining commands will take a long time given that the files are 3GB in size. In that case "efficiency" is really meaningless.

I would suggest writing a custom program for this if runtime, memory usage, etc. is a primary concern.

I don't want to write a standard-utility-using script because I see some potential headaches in implementing it, and I don't like headaches. That said, this simple pipeline should give a starting point for a script. It removes all the lines that are identical--leaving one line per file where there is a difference.

The script then needs to parse the results two lines at a time, identify the signal with the inconsistency, add 1 to a counter for that signal, record the time the inconsistency occurred, and print out the fancy report. Parsing gives me a headache and associative arrays give me a headache

Code:

$ cat file1.txt
0ns sig1 0
0ns sig2 0
0ns sig3 0
0ns sig4 1
1ns sig1 0
1ns sig2 0
1ns sig3 0
1ns sig4 0
2ns sig1 0
2ns sig2 0
2ns sig3 0
2ns sig4 0
3ns sig1 1
3ns sig2 0
3ns sig3 0
3ns sig4 0
$ cat file2.txt
0ns sig1 0
0ns sig2 1
0ns sig3 1
0ns sig4 1
1ns sig1 0
1ns sig2 0
1ns sig3 0
1ns sig4 0
2ns sig1 1
2ns sig2 0
2ns sig3 0
2ns sig4 0
3ns sig1 0
3ns sig2 1
3ns sig3 0
3ns sig4 0
$ cat file1.txt file2.txt | sort | uniq -u
0ns sig2 0
0ns sig2 1
0ns sig3 0
0ns sig3 1
2ns sig1 0
2ns sig1 1
3ns sig1 0
3ns sig1 1
3ns sig2 0
3ns sig2 1

lithos · 01-06-2012, 05:09 PM

what about

Code:

# diff --speed-large-files --suppress-common-lines file1 file2

from diff man

Code:

   --speed-large-files
       Assume large files and many scattered small changes.

   --suppress-common-lines
       Do not output common lines.

TB0ne · 01-07-2012, 10:45 AM

Nice one lithos, and good point DarkHelmet.

I'd go the SQL route myself, though, for the aforementioned reasons. Even if the OP doesn't have admin rights, an administrator can create one database in MySQL, and grant full access to one user, so the mysql statements can work, even if you're just a 'regular' user, and let them get things done.

kaaliakahn · 01-07-2012, 12:28 PM

Hi Guys,

Thanks a lot for looking at my thread. The reason SQL is not an option is because

1) I dont have a clue about sql
2) If i write it as a program, i am not sure about its portability
3) Are n't linux utilities good enough to do it?

Hope it makes sense. I really appreciate all of you from the core of my heart for putting your comments and making it easy for me to do it.

Kind Regards,
kaaliakahn

TB0ne · 01-07-2012, 07:04 PM

Quote:

Originally Posted by kaaliakahn

Hi Guys,
Thanks a lot for looking at my thread. The reason SQL is not an option is because

1) I dont have a clue about sql
2) If i write it as a program, i am not sure about its portability
3) Are n't linux utilities good enough to do it?

Hope it makes sense. I really appreciate all of you from the core of my heart for putting your comments and making it easy for me to do it.

No worries. Linux utilities CAN be used for this, but again, it depends on what I posted previously, which you didn't answer.

SQL isn't a difficult thing to learn, and if you write a Linux script, a program that uses MySQL (could even BE a bash script that just queries MySQL), is just as portable. Here's some links to show you how easy SQL can be, especially for something simple like this:
http://www.yolinux.com/TUTORIALS/Lin...rialMySQL.html
http://www.linuxforums.org/forum/pro...rt-script.html
http://www.unix.com/shell-programmin...ll-script.html

I think it's the best way to go, since you could then have lots of flexibility in your reporting and access. Even a Windows user could hit it with the ODBC driver from Excel, and perform their own reports. But, this is up to you to figure out....it has to be what you are comfortable with.

schneidz · 01-07-2012, 10:12 PM

maybe a c-program using strncmp will be slightly faster ?