LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-22-2012, 05:02 PM   #1
seabro
LQ Newbie
 
Registered: Jan 2010
Posts: 24

Rep: Reputation: 0
Smile test file processing question


hi all ,

I have a dilemma I hope you can help me solve.

I have a largish text file that I want to process.

The file has 4 colums separated by tab. so it looks like this.


fred john dave pete
dave pete terry phil
john dave pete fred

I would like to remove all lines where there are more than one duplicate entry in column 4.

I am not looking to remove duplicates, I want to completely remove the lines that have more than one entry in column 4.

So if 2 or more entries in colum 4 are the same, remove those two rows. I want to be left with only rows who only ever had a single entry in column 4.

Could you help with this? If so, thanks in advance.

seabro

ps. I am using Centos so I guess tools like grep and awk might do it I just dont know how.

Last edited by seabro; 05-22-2012 at 05:06 PM.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 05-22-2012, 05:30 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,084

Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by seabro View Post
I would like to remove all lines where there are more than one duplicate entry in column 4.
As you describe the file and as you show by example, column 4 consists of one word. Consequently it is impossible to have a duplicate entry in column 4. Please elaborate or correct your problem statement.

It is always helpful to give two sample files, a Before and an After. You gave a Before but no After.

Daniel B. Martin
 
Old 05-22-2012, 05:34 PM   #3
seabro
LQ Newbie
 
Registered: Jan 2010
Posts: 24

Original Poster
Rep: Reputation: 0
hi Daniel,

thanks for your reply.

I guess I forgot to add that the test of duplication should be vertically on column 4

before

bob jon fred pete
fred john phil dave
mike phil john pete
fred jack bob seabro

after

fred john phil dave
fred jack bob seabro


I hope that makes more sense!
thanks again,
Seabro
 
Old 05-22-2012, 06:23 PM   #4
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,492

Rep: Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956
Code:
awk 'NR == FNR { _[$4]++ } NR > FNR && _[$4] == 1' file file
The double argument is not an error: it causes awk to process the file twice: the first time it counts how many repetitions of the name in field 4 are there, the second time it prints out only lines with unique names. Hope this helps.
 
3 members found this post helpful.
Old 05-22-2012, 07:43 PM   #5
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 583

Rep: Reputation: 129Reputation: 129
Code:
sort -k4 test.txt|uniq -f3 -u
Use sort with the selection of the fourth field and remove with uniq the lines with the duplicates.
 
3 members found this post helpful.
Old 05-31-2012, 11:02 AM   #6
seabro
LQ Newbie
 
Registered: Jan 2010
Posts: 24

Original Poster
Rep: Reputation: 0
Thanks Guys, that worked a treat!
 
  


Reply

Tags
processing, text


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] A parallel processing question ITPhoenix Linux - Newbie 1 03-06-2011 07:38 AM
How to processing the log file within certain dates based on the file name shyork2001 Linux - General 1 04-08-2010 03:35 PM
rpmbuild error: File ./test-2.3.0/SOURCES/test-2.3.0.tar.gz: No such file powah Linux - Software 1 12-13-2007 03:30 PM
Java File Processing using Scanner; Can't get it to read file in running directory xemous Programming 2 09-26-2006 06:13 PM
c++ file processing -- how to remove a record from a file sharonyiisl Programming 4 09-26-2004 03:54 AM


All times are GMT -5. The time now is 05:45 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration