LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   compare columns of uneven length (https://www.linuxquestions.org/questions/linux-newbie-8/compare-columns-of-uneven-length-857508/)

bcrawl 01-20-2011 07:25 AM

compare columns of uneven length
 
I have a text file with 2 columns. Column A has 69,000 rows. Column B has 49,000 rows.

Column A has our complete product list Column B has product list from Manufacturer 1

There are only certain/some rows which are common between 2 columns. and also, column B is not a subset of column A. Column A has extra entries and so does column B.

I need to know, which rows from Column B, are common with Column A which rows from Column B are not common with Column A.

Essentially I want to know from this list, how many of our products are from my manufacturer 1, how many does my manufacturer has which we dont carry.

How would I acheive this? My natural approach to solving this kind of obstacle is to reach for MS excel and use its lookup function, but its not working... Its taking forever and hanging up. since the file is so huge and probably my excel skills are really bad.

How can I do this from command line? I am looking for awk command if possible instead of sed since I am trying to pick up its syntax and usage etc. My thought process is, sort column A,B, for every row in A, lookup and output based on condition. Dont know if I am on the right track. Any help would be great.

Thanks in advance.

druuna 01-20-2011 07:31 AM

Hi,

Have a look at the comm command.

Examples:
Code:

$ cat prodlist
a
b
c
d
e
f
g
h

$ cat manlist
a
d
g
j
l
p

$ comm -13 prodlist manlist
j
l
p

$ comm -12 prodlist manlist
a
d
g

comm -13 shows entries unique to manlist, comm -12 shows entries common to both lists.

Both lists do need to be sorted for this to work.

Hope this helps.

bcrawl 01-20-2011 07:35 AM

awesome, I will look into it. Thanks!

druuna 01-20-2011 07:35 AM

You're welcome :)

bcrawl 01-20-2011 08:18 AM

Quote:

Originally Posted by druuna (Post 4231515)
You're welcome :)

Hi, thank you for the comm tool. It worked perfectly. so I was wondering if you could help me streamline my process. right now Both prodlist and manlist are unsorted. So, I manually issued sort commands like so,
Code:

sort -u prodlist > prodlist1
Code:

sort -u manlist > manlist1
and then I am applying the
Code:

comm -12 prodlist manlist | wc -l
command to look for the count.

I guess what I wanted to ask was, how would I combine it so that I avoid the intermediae step of saving a sorted file separetely. Like I want to do something like this but the output is obviously wrong...
Code:

sort -u prodlist1 manlist1 | comm -13 prodlist manlist
thanks.

colucix 01-20-2011 08:21 AM

Use process substitution:
Code:

comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l

bcrawl 01-20-2011 08:37 AM

Quote:

Originally Posted by colucix (Post 4231580)
Use process substitution:
Code:

comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l

Hey, Thanks a lot. Really Helpfull!

colucix 01-20-2011 08:39 AM

You're welcome! :)


All times are GMT -5. The time now is 10:30 AM.