compare columns of uneven length

bcrawl · 01-20-2011, 07:25 AM

I have a text file with 2 columns. Column A has 69,000 rows. Column B has 49,000 rows.

Column A has our complete product list Column B has product list from Manufacturer 1

There are only certain/some rows which are common between 2 columns. and also, column B is not a subset of column A. Column A has extra entries and so does column B.

I need to know, which rows from Column B, are common with Column A which rows from Column B are not common with Column A.

Essentially I want to know from this list, how many of our products are from my manufacturer 1, how many does my manufacturer has which we dont carry.

How would I acheive this? My natural approach to solving this kind of obstacle is to reach for MS excel and use its lookup function, but its not working... Its taking forever and hanging up. since the file is so huge and probably my excel skills are really bad.

How can I do this from command line? I am looking for awk command if possible instead of sed since I am trying to pick up its syntax and usage etc. My thought process is, sort column A,B, for every row in A, lookup and output based on condition. Dont know if I am on the right track. Any help would be great.

Thanks in advance.

druuna · 01-20-2011, 07:31 AM

Hi,

Have a look at the comm command.

Examples:

Code:

$ cat prodlist
a
b
c
d
e
f
g
h

$ cat manlist
a
d
g
j
l
p

$ comm -13 prodlist manlist
j
l
p

$ comm -12 prodlist manlist
a
d
g

comm -13 shows entries unique to manlist, comm -12 shows entries common to both lists.

Both lists do need to be sorted for this to work.

Hope this helps.

bcrawl · 01-20-2011, 07:35 AM

awesome, I will look into it. Thanks!

druuna · 01-20-2011, 07:35 AM

You're welcome

bcrawl · 01-20-2011, 08:18 AM

Quote:

Originally Posted by druuna

You're welcome

Hi, thank you for the comm tool. It worked perfectly. so I was wondering if you could help me streamline my process. right now Both prodlist and manlist are unsorted. So, I manually issued sort commands like so,

Code:

sort -u prodlist > prodlist1

Code:

sort -u manlist > manlist1

and then I am applying the

Code:

comm -12 prodlist manlist | wc -l

command to look for the count.

I guess what I wanted to ask was, how would I combine it so that I avoid the intermediae step of saving a sorted file separetely. Like I want to do something like this but the output is obviously wrong...

Code:

sort -u prodlist1 manlist1 | comm -13 prodlist manlist

thanks.

colucix · 01-20-2011, 08:21 AM

Use process substitution:

Code:

comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l

bcrawl · 01-20-2011, 08:37 AM

Quote:

Originally Posted by colucix

Use process substitution:

Code:

comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l

Hey, Thanks a lot. Really Helpfull!

colucix · 01-20-2011, 08:39 AM

You're welcome!