LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-06-2012, 11:42 AM   #1
kaaliakahn
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Rep: Reputation: Disabled
file comparison of huge files


Hi all,
This is my first post to the forum. I am very happy to see your contribution. I am eager to become part of it.

I have the following question. I have two huge files to compare (almost 3GB each). The files are simulation outputs. The format of the files are as below

(please also see attached figure if the following table is confusing)

File 1 File 2
----------------- -------------------
Time sig_name sig_val Time sig_name sig_val

0ns sig1 0 0ns sig1 0
0ns sig2 0 0ns sig2 1
0ns sig3 0 0ns sig3 1
0ns sig4 1 0ns sig4 1
1ns sig1 0 1ns sig1 0
1ns sig2 0 1ns sig2 0
1ns sig3 0 1ns sig3 0
1ns sig4 0 1ns sig4 0
2ns sig1 0 2ns sig1 1
2ns sig2 0 2ns sig2 0
2ns sig3 0 2ns sig3 0
2ns sig4 0 2ns sig4 0
3ns sig1 1 3ns sig1 0
3ns sig2 0 3ns sig2 1
3ns sig3 0 3ns sig3 0
3ns sig4 0 3ns sig4 0

Given the two files in the above format, how can i print out the following table from "efficient" file comparison. Efficiency is required as file size is over 3GB



signal number_of_mismatches time_of_mismatch
------ -------------------- ----------------
sig1 2 2ns, 3ns
sig2 2 0ns, 3ns
sig3 1 0ns
sig4 0


I shall really appreciate your response.
Attached Thumbnails
Click image for larger version

Name:	filecompare.jpg
Views:	15
Size:	87.7 KB
ID:	8766  

Last edited by kaaliakahn; 01-06-2012 at 01:56 PM.
 
Old 01-06-2012, 12:31 PM   #2
thesnow
Member
 
Registered: Nov 2010
Location: Minneapolis, MN
Distribution: Ubuntu, Red Hat, Mint
Posts: 172

Rep: Reputation: 56
Do you have the option to import the data into a DBMS and write SQL to do the comparison?
 
Old 01-06-2012, 01:53 PM   #3
kaaliakahn
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
No there is no option to do SQL.

I believe this should be pretty straight with grep, sed and awk?

Any expert's opinion?
 
1 members found this post helpful.
Old 01-06-2012, 02:46 PM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by kaaliakahn View Post
No there is no option to do SQL.
I believe this should be pretty straight with grep, sed and awk?
Any expert's opinion?
Yes, there IS an option to use SQL. If you have the data in a text file, write a quick little program to read it, and import it into a MySQL (or whatever) database/table. Perform your operations on it there. That's the quickest way to do it. You *CAN* stick something together with grep, etc., but for 3GB files, it'll take a good bit of time to work, although the script itself may be simple.

If this is something you have to do every now and then, writing the import routine would be the best option. Plus, you then have more flexibility in what you can look at, and how. If it's SQL, you can hit it with Excel/OpenOffice, get charts/graphs, and do ad-hoc reporting very easily.
 
Old 01-06-2012, 04:25 PM   #5
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Quote:
Originally Posted by TB0ne
Yes, there IS an option to use SQL
Not necessarily. Maybe he does not have administrative rights to install a SQL server (or any other software for that matter). It may not be cost effective for them to install SQL support this single task. There could be any number of reasons why SQL is not an option.

I do agree that pipelining commands will take a long time given that the files are 3GB in size. In that case "efficiency" is really meaningless.

I would suggest writing a custom program for this if runtime, memory usage, etc. is a primary concern.

I don't want to write a standard-utility-using script because I see some potential headaches in implementing it, and I don't like headaches. That said, this simple pipeline should give a starting point for a script. It removes all the lines that are identical--leaving one line per file where there is a difference.

The script then needs to parse the results two lines at a time, identify the signal with the inconsistency, add 1 to a counter for that signal, record the time the inconsistency occurred, and print out the fancy report. Parsing gives me a headache and associative arrays give me a headache

Code:
$ cat file1.txt
0ns sig1 0
0ns sig2 0
0ns sig3 0
0ns sig4 1
1ns sig1 0
1ns sig2 0
1ns sig3 0
1ns sig4 0
2ns sig1 0
2ns sig2 0
2ns sig3 0
2ns sig4 0
3ns sig1 1
3ns sig2 0
3ns sig3 0
3ns sig4 0
$ cat file2.txt
0ns sig1 0
0ns sig2 1
0ns sig3 1
0ns sig4 1
1ns sig1 0
1ns sig2 0
1ns sig3 0
1ns sig4 0
2ns sig1 1
2ns sig2 0
2ns sig3 0
2ns sig4 0
3ns sig1 0
3ns sig2 1
3ns sig3 0
3ns sig4 0
$ cat file1.txt file2.txt | sort | uniq -u
0ns sig2 0
0ns sig2 1
0ns sig3 0
0ns sig3 1
2ns sig1 0
2ns sig1 1
3ns sig1 0
3ns sig1 1
3ns sig2 0
3ns sig2 1

Last edited by Dark_Helmet; 01-06-2012 at 04:32 PM.
 
1 members found this post helpful.
Old 01-06-2012, 05:09 PM   #6
lithos
Senior Member
 
Registered: Jan 2010
Location: SI : 45.9531, 15.4894
Distribution: CentOS, OpenNA/Trustix, testing desktop openSuse 12.1 /Cinnamon/KDE4.8
Posts: 1,144

Rep: Reputation: 217Reputation: 217Reputation: 217
what about
Code:
# diff --speed-large-files --suppress-common-lines file1 file2
from diff man
Code:
   --speed-large-files
       Assume large files and many scattered small changes.

   --suppress-common-lines
       Do not output common lines.
 
1 members found this post helpful.
Old 01-07-2012, 10:45 AM   #7
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Nice one lithos, and good point DarkHelmet.

I'd go the SQL route myself, though, for the aforementioned reasons. Even if the OP doesn't have admin rights, an administrator can create one database in MySQL, and grant full access to one user, so the mysql statements can work, even if you're just a 'regular' user, and let them get things done.
 
Old 01-07-2012, 12:28 PM   #8
kaaliakahn
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Hi Guys,

Thanks a lot for looking at my thread. The reason SQL is not an option is because

1) I dont have a clue about sql
2) If i write it as a program, i am not sure about its portability
3) Are n't linux utilities good enough to do it?

Hope it makes sense. I really appreciate all of you from the core of my heart for putting your comments and making it easy for me to do it.

Kind Regards,
kaaliakahn
 
Old 01-07-2012, 07:04 PM   #9
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,636

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by kaaliakahn View Post
Hi Guys,
Thanks a lot for looking at my thread. The reason SQL is not an option is because

1) I dont have a clue about sql
2) If i write it as a program, i am not sure about its portability
3) Are n't linux utilities good enough to do it?

Hope it makes sense. I really appreciate all of you from the core of my heart for putting your comments and making it easy for me to do it.
No worries. Linux utilities CAN be used for this, but again, it depends on what I posted previously, which you didn't answer.

SQL isn't a difficult thing to learn, and if you write a Linux script, a program that uses MySQL (could even BE a bash script that just queries MySQL), is just as portable. Here's some links to show you how easy SQL can be, especially for something simple like this:
http://www.yolinux.com/TUTORIALS/Lin...rialMySQL.html
http://www.linuxforums.org/forum/pro...rt-script.html
http://www.unix.com/shell-programmin...ll-script.html

I think it's the best way to go, since you could then have lots of flexibility in your reporting and access. Even a Windows user could hit it with the ODBC driver from Excel, and perform their own reports. But, this is up to you to figure out....it has to be what you are comfortable with.
 
Old 01-07-2012, 10:12 PM   #10
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
maybe a c-program using strncmp will be slightly faster ?
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
smart binary files comparison poorman_installer Linux - Software 2 10-27-2009 07:51 AM
binary comparison of files in C++ mac1234mac Programming 1 05-07-2008 03:00 PM
Comparison of two files its_joy Linux - Newbie 2 01-14-2008 02:19 PM
PHP: huge functions file vs multiple small files carlosruiz Programming 2 06-14-2005 03:06 AM
Large tar file taking huge disk space in ext3 file system pcwulf Linux - General 2 10-20-2003 07:45 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:53 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration