Comparing text files...

jong357 · 03-31-2007, 12:29 PM

Greets all. I'm having a brain freeze and can't figure out this simple problem. Been in windows too much I guess...

I have 2 files. File 1 is a complete list of what I need. File 2 has one or more lines missing but is more or less the same as file 1. I need to compare file 1 and 2 but only print the extra lines found in file 1. The files contain nothing but single words on each line if that matters. I've looked into sdiff, cmp, awk, uniq et. all but am still stuck for some reason. None of those seem to do what I want except for some sort of awk array maybe... But that still seems like overkill.

Thanks in advance for pointing out the obvious...

H_TeXMeX_H · 03-31-2007, 12:36 PM

'diff' should work fine ... did you try that ? Read 'man diff'

jong357 · 03-31-2007, 12:57 PM

Yea, I looked at that. It outputs garbage along with what I need.

I tried doing something hackish like:

cat file1 >> file2
uniq -u file2 missing-text.txt

But it doesn't work.. missing-text.txt is the same as file2. Makes no sense.

H_TeXMeX_H · 03-31-2007, 01:08 PM

Try something like:

Code:

diff rc.S rc.S-backorigdrax | tail -n4 | sed 's/< //'

where 'tail -n' omits the first line, and "sed 's/< //' gets rid of the '< '.

jong357 · 03-31-2007, 01:38 PM

tail -n outputs the last N lines not omits the first line. That could short me a bunch of files...

piping to sed isn't bad I guess (kludgy tho) but that doesn't work in all circumstances. Here is my:

Code:

$ diff file1 file2      
14c14
< mkcfm
---
>

The only thing your command does is scoot the 'name' I want to the beginning of the line. I'd have to pipe multiple times to get it solo... Extremely kludgy. Also, I'm assuming the "---" is because their is a blank line in file2 which I don't need to account for. This just doesn't seem the way to go. Way too specific and sloppy to boot.

I'd really like to get sort or uniq working. I guess I just don't understand uniq and why -u isn't doing anything.

H_TeXMeX_H · 03-31-2007, 01:45 PM

Not sure if a bash script is the best thing for this ... maybe perl ? I mean, you don't want 'kludgy', so ...

jong357 · 03-31-2007, 01:53 PM

Sure, I could write a perl script. I'd just tick everything in bash...

I suck at perl. Besides, this is going into an existing bash script anyway.

The term 'kludgy' is subjective I guess. I'd really like to keep it down to just 2 or 3 short lines. This is an extremely easy operation (should be anyway), it's just eluding me for some reason.

Thanks for your help thus far. I'm still open to suggestions. Especially clarification on correct usage of uniq (ditching all repeated lines in one file)...

gnashley · 03-31-2007, 02:06 PM

comm is what you want:
comm - compare two sorted files line by line
comm -3 suppress lines that appear in both files

Don't forget the -u option for sort, which may behave differently than sort itself.

jong357 · 03-31-2007, 02:21 PM

Cool. comm does seem to be what I want but check this out.

File1

Code:

cat
boy
dog
bird

File2

Code:

cat
bird

Code:

$ comm -3 file1 file2
        bird
boy
dog
bird

No good.

H_TeXMeX_H · 03-31-2007, 02:30 PM

try sorting before you run it

Code:

sort file1 > file1new
sort file2 > file2new
comm -3 file1new file2new

jong357 · 03-31-2007, 02:37 PM

Yea.. I thought of that after posting my last comment. It works... But... Now we are back into being kludgy again... Don't you just love that word?

I don't get why this has to be a multiple step process but whatever I guess... See, once I extroplate the missing name, I have to fetch a version number to tag onto it, integrate it with the file that didn't have it and then sort them according to another 'order' file... I'm sorting twice, once with 'sort' and then thru a function I have to sort according to a static list.

It'll work tho so it's all good. Thanks guys!

simcox1 · 03-31-2007, 03:02 PM

You have two files. File1 and file2. You want to output only the differences.

Does this work?

cat file1 file2 | sort -u

The reason that uniq -u isn't working, is because the text isn't sorted first. Sort -u will show only unique lines with unsorted text.

You could also output it to a new file.

cat file1 file2 | sort -u > file3

jong357 · 03-31-2007, 03:53 PM

That doesn't work. That gives the same exact output that is in file1 to begin with. All I need is the missing bits.

Code:

$ cat file1 file2 | sort -u
bird
boy
cat
dog

$cat file1
cat
boy
dog
bird

I don't NEED to sort anything. The only reason why I'm using 'sort' is because it seems to be necessary for comm to function. What I need is the missing words from file1...

simcox1 · 03-31-2007, 04:10 PM

Yes. How about this.

cat file1 file2 | sort | uniq -u

cat file1 file2 | sort | uniq -d

The first one gives only unique lines. The second gives duplicates.

jong357 · 03-31-2007, 04:29 PM

And there it is... The elegant one-liner that has been eluding me...

Funny thing is, I tried that after your first post but gave sort the u switch instead of calling it vanilla. Seems my ignorance with these commands and patience with 'man' was the only problem here...

Thanks again everyone. Much appreciated!!