LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-20-2011, 08:25 AM   #1
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Rep: Reputation: 0
compare columns of uneven length


I have a text file with 2 columns. Column A has 69,000 rows. Column B has 49,000 rows.

Column A has our complete product list Column B has product list from Manufacturer 1

There are only certain/some rows which are common between 2 columns. and also, column B is not a subset of column A. Column A has extra entries and so does column B.

I need to know, which rows from Column B, are common with Column A which rows from Column B are not common with Column A.

Essentially I want to know from this list, how many of our products are from my manufacturer 1, how many does my manufacturer has which we dont carry.

How would I acheive this? My natural approach to solving this kind of obstacle is to reach for MS excel and use its lookup function, but its not working... Its taking forever and hanging up. since the file is so huge and probably my excel skills are really bad.

How can I do this from command line? I am looking for awk command if possible instead of sed since I am trying to pick up its syntax and usage etc. My thought process is, sort column A,B, for every row in A, lookup and output based on condition. Dont know if I am on the right track. Any help would be great.

Thanks in advance.
 
Old 01-20-2011, 08:31 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Hi,

Have a look at the comm command.

Examples:
Code:
$ cat prodlist
a
b
c
d
e
f
g
h

$ cat manlist
a
d
g
j
l
p

$ comm -13 prodlist manlist
j
l
p

$ comm -12 prodlist manlist
a
d
g
comm -13 shows entries unique to manlist, comm -12 shows entries common to both lists.

Both lists do need to be sorted for this to work.

Hope this helps.
 
Old 01-20-2011, 08:35 AM   #3
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Original Poster
Rep: Reputation: 0
awesome, I will look into it. Thanks!
 
Old 01-20-2011, 08:35 AM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
You're welcome
 
Old 01-20-2011, 09:18 AM   #5
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by druuna View Post
You're welcome
Hi, thank you for the comm tool. It worked perfectly. so I was wondering if you could help me streamline my process. right now Both prodlist and manlist are unsorted. So, I manually issued sort commands like so,
Code:
sort -u prodlist > prodlist1
Code:
sort -u manlist > manlist1
and then I am applying the
Code:
comm -12 prodlist manlist | wc -l
command to look for the count.

I guess what I wanted to ask was, how would I combine it so that I avoid the intermediae step of saving a sorted file separetely. Like I want to do something like this but the output is obviously wrong...
Code:
sort -u prodlist1 manlist1 | comm -13 prodlist manlist
thanks.
 
Old 01-20-2011, 09:21 AM   #6
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Use process substitution:
Code:
comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l
 
Old 01-20-2011, 09:37 AM   #7
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by colucix View Post
Use process substitution:
Code:
comm -12 <(sort -u prodlist) <(sort -u manlist) | wc -l
Hey, Thanks a lot. Really Helpfull!
 
Old 01-20-2011, 09:39 AM   #8
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
You're welcome!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compare two columns in a file. shilpa.godhe Linux - Newbie 2 03-29-2010 03:42 AM
how to make several columns in a text file of different lengths the same length part2 Mike_V Programming 11 05-15-2009 02:44 AM
how to make several columns in a text file of different lengths the same length Mike_V Programming 3 04-23-2009 10:17 PM
compare $php variable to indexed distinct mysql columns secretlydead Programming 1 02-18-2008 11:48 PM


All times are GMT -5. The time now is 08:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration