LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   more file comparison (http://www.linuxquestions.org/questions/linux-newbie-8/more-file-comparison-635320/)

Stannjudy 04-14-2008 03:38 PM

more file comparison
 
I have searched and not found quite what will work for me. I am fairly new to this and need to compare two lists and delete the difference.

The problem I'm having is that the lines have different lengths of value. i.e.

one list give the label:
ABC001L1
ABC002L1
ABD090L1

while the other gives more information:
ABC001L1 /filename/date/initials 0021 files
ABC050L1 /filename/differnetdate/initials 0034 files

etc....what I need is to compare list1 against list2 and come up with the same (or delete the different) labels on the lists. I want to keep the ones with the label names only...

I've looked at diff -E, fgrep (which I don't fully understand - especially looking at the man page), and comm...none seem to do what I need.

Both lists are 4,000 to 6,000 entries.

Thanks,
Stan

colucix 04-14-2008 04:31 PM

Maybe join is what you're looking for:
Code:

$ cat file_one
ABC001L1
ABC002L1
ABD090L1

$ cat file_two
ABC001L1 /filename/date/initials 0021 files
ABC050L1 /filename/differnetdate/initials 0034 files

$ join -o 1.1 file_one file_two
ABC001L1

See man join for details and more options.

bigrigdriver 04-14-2008 04:42 PM

This looks like a problem designed for awk.

If the two files have the label on the left, you can use awk to specify the single space as the field seperator. That results in each line being seperated into fields. Each field is filled with the contents of a word or path seperated by a space at each end.

Use a loop that looks for the second field ($2 in awk). If it isn't empty then delete that line.
Else if the second field is empty, then write the contents of field 1 ($1) to another file.

That will give you two files with only the labels in them.

If you want to compare the two files in such a way that only unique labels remain (labels which are not common to both files), then use diff to compare the files and pipe the output through uniq to select only the lables that are unique, and write them to a single file.

Update: I just saw the solution proposed by colucix. His solution joins the two files, but doesn't remove the lables with description, leaving only the lables.

So, use join to join the two files, then use awk to select only the files without descriptions as I described above, selecting on the basis on content in the second fiels ($2).

colucix 04-14-2008 05:04 PM

Quote:

Originally Posted by bigrigdriver (Post 3121177)
Update: I just saw the solution proposed by colucix. His solution joins the two files, but doesn't remove the lables with description, leaving only the lables.

With the -o option you can control the output format, selecting one or more fields from one or both files. Indeed, in my example it keeps the labels only.

I was just thinking about a problem: if the two files are not sorted, some labels may be missed. You can circumvent this problem by passing the sorted file with process substitution:
Code:

join -o 1.1 <(sort file_one) <(sort file_two)
Anyway, I agree that some lines of awk code can give a finer control on the output format.

beadyallen 04-14-2008 07:45 PM

From what I can tell (and I'm probably wrong), you want to output a line in file2 if there's a corresponding line in file1. Is that right? If so, how about the following:
Code:

for x in $(cat file1);
do
  grep "^${x} " file2;
done

If you want to just keep the common label names, stick an if statement in there, like:
Code:

for x in $(cat file1)
do
  result=$(grep "^${x} " file2)
  if [ "$result" ]
  then
    echo $x
  fi
done

Is that what you're wanting?

Stannjudy 04-15-2008 10:57 AM

Thanks for all the great help and advice. Actually, beadyallen hit it on the head (I didn't really make myself clear enough). I do want the entire line from file2 if there is a corresponding label in file1. Maybe it would be clearer if I gave a wee bit of background. These are lists of tapes. File1 is from the dept. that wants certain tapes duplicated, file2 is from querying the db for all tapes within a given range (which covers what the dept wants and more). These need to be - and are - sorted by filefamily from the query when doing the duplication.

I tried running both the codes mentioned by beadyallen, but (and only guessing here) maybe because I'm using bash as my shell, here is the results of the first:

-bash-3.00$ for x in $(cat migrate_now.txt); do grep "$(x) " new-mig-list >> mig_joined; done
-bash: x: command not found

and the second would not end with done:

-bash-3.00$ for x in $(cat migrate_now.txt); do result=$(grep "^${x} " new-mig-list
> if [ "$result" ]
> then
> echo $x
> fi
> done
>
-bash-3.00$ or x in $(cat migrate_now.txt)
-bash: or: command not found
-bash-3.00$ for x in $(cat migrate_now.txt)
> do result=$(grep "^${x} " new-mig-list
> if [ "$result" ]
> then
>
> echo $x
> fi
> done
>
>
Oh, and join gave me way too much info and too many duplicate entries. File1 is 5878 lines (from wc) to give you an idea of the scope.

thanks again...
Stanley

beadyallen 04-15-2008 12:07 PM

If you've cut and pasted the output you got, you've not typed it in properly. You've missed brackets, used the wrong brackets( '()' instead of '{}' etc). Based on what you've written, the following should work if you just cut and paste it:
Code:

for x in $(cat migrate_now.txt);
do
  grep "^${x} " new-mig-list;
done >> mig_joined

or

Code:

for x in $(cat migrate_now.txt)
do
  result=$(grep "^${x} " new-mig-list)
  if [ "$result" ]
  then
    echo $x
  fi
done >> mig_joined

Oh, and they're both bash scripts.

Stannjudy 04-15-2008 01:08 PM

I realized that later....in looking closer. Thanks. I did:
for x in $( < migrate_now.txt); do grep $x new-mig-list ;done > mig_joined

and that gave me what I needed. Now I just need to re-sort it by file families and it will be great.

Thanks for all your help and for everybody who chimed in. I certainly appreciate it and hopefully have even learned something!!

Stanley


All times are GMT -5. The time now is 06:36 PM.