LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 01-06-2011, 07:46 PM   #1
jrmorrill
LQ Newbie
 
Registered: Jan 2011
Posts: 2

Rep: Reputation: 0
Removing lines from a file that are in another file


I have a large text file containing over 180k lines and another text file containing about 1k. I would like to remove lines in the 180k-line file that exist in the 1k-line file. I thought there was a simple way to do this but I haven't come across it yet. Any advice? Thanks
 
Old 01-06-2011, 10:42 PM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Can we get some examples of the text involved? If the lines to be matched are identical, then it should be easy. But there may be more work involved if you need to make partial matches, or if there's a chance of oddball characters or the like.

Here's a grep one-liner that should work in the simplest case. You'll have to direct it into a new file, as grep doesn't have in-place editing.
Code:
grep -v -F -f "smallfile.txt" largefile.txt >newfile.txt
-v invert match (print non-matching lines)
-F fixed strings (for efficiency and to avoid regex problems)
-f match strings from filename

Last edited by David the H.; 01-06-2011 at 10:44 PM. Reason: minor addition
 
0 members found this post helpful.
Old 01-06-2011, 10:49 PM   #3
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
There is a command "comm" that can do this for two sorted lists.

comm -13 <(sort file1) <(sort file2)
would return lines that are unique to file2. The results are of course sorted.
If the files are already sorted you could use:
comm -13 file1 file2

I use this to compare two directory listings and produce a script to delete files from some devices.
 
0 members found this post helpful.
Old 01-07-2011, 02:32 PM   #4
jrmorrill
LQ Newbie
 
Registered: Jan 2011
Posts: 2

Original Poster
Rep: Reputation: 0
Quote:
Can we get some examples of the text involved?
Sure, the lines have only one word and are all lower case characters, nothing special.

file_1.txt:

aaa
bbb
ccc
ddd

file_2.txt:

bbb
ccc

output.txt:

aaa
ddd


Thanks for the one-liners! I'll give them both a shot.
 
Old 01-07-2011, 04:56 PM   #5
trickykid
LQ Guru
 
Registered: Jan 2001
Posts: 24,149

Rep: Reputation: 269Reputation: 269Reputation: 269
Behold, the power of sed, no reason to create a 3rd file with the output from grep or worry if the files are already sorted or not.

Well, first backup your file_1.txt, the larger one you want to delete lines if they exist in file_2.txt, just in case.

Code:
for PATTERN in `cat file_2.txt`; do sed -i "/$PATTERN/d" file_1.txt; done

Last edited by trickykid; 01-07-2011 at 04:58 PM.
 
Old 01-07-2011, 06:04 PM   #6
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
An awk solution:
Code:
awk '! _[$0]++ && FNR < NR' file_2.txt file_1.txt
but this assumes there are no repetitions of the same line, otherwise doubles will be omitted. The order of the two conditions and that one of the two arguments is mandatory.

Edit: to take in account repetitions a simple modification is required:
Code:
awk '! _[$0]++ && FNR < NR {print; delete _[$0]}' file_2.txt file_1.txt
or
Code:
awk '! _[$0]++ && FNR < NR {print; _[$0]--}' file_2.txt file_1.txt
whereas the delete statement is not available, since it is a GNU awk extension.

Last edited by colucix; 01-07-2011 at 06:10 PM.
 
0 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing all lines in a file containing sameword. secondchanti Linux - Newbie 10 08-06-2010 12:16 PM
removing blank lines in a text file christianunix Linux - Newbie 11 10-29-2007 12:24 AM
Removing new lines from a file psandeepnair1985 Programming 5 03-25-2007 11:46 AM
removing lines from file script iluvatar Programming 9 08-20-2004 05:49 AM
Removing lines from file Aylar Programming 2 04-22-2004 06:34 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 10:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration