LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-29-2013, 08:39 PM   #1
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 25

Rep: Reputation: Disabled
Question AWK: Remove of Lines matching from a supplied List of Objects?


Does anyone here know how to remove any line that contains a match, from a supplied list of objects? I have a list of 1 million lines.

I need to remove each line that contains a match from a supplied list, I do not need to merely remove the objects that match, but the entire line.

I did not want to do this with grep or sed, Ive got scripts for grep and sed that do it, but they are sooo much slower than AWK. AWK seems to be lightning fast.

Ok so heres a sample of my dillema


targetfile.txt

Code:
.ski-acrobatique.net
.skiangpa.bilji.org
.skibble.info
.skibob.info
.skibone.com
.bilji.org
.skibrooklyn.bestdeals.at
.skicero.bilji.org
.skidajtange.com
.skidarmy.bestdeals.at
.skidgalleries.com
.skidgaygalleries.com
.skidka-ddddd90.bestdeals.at
.skidka-dvsem.lookin.at
.skidman.com
.skiefjeff.info
.bestdeals.at
2remove.txt

Code:
.bber.info
.bestdeals.at
.bestfor.ru
.bfrazeredit.com
.bilji.org
.bitat.com
.bitches-xui.info
.bitcomet.com
.bitreactor.to
.blackvalt.com
.blissdisplays.com

Last edited by CaptainDerp; 05-29-2013 at 08:40 PM.
 
Old 05-30-2013, 12:29 AM   #2
Ygrex
Member
 
Registered: Nov 2004
Location: Russia (St.Petersburg)
Distribution: Debian
Posts: 666

Rep: Reputation: 68
if 2remove.txt file is huge:
Code:
$ cat targetfile.txt | awk '{ m=0 ; while ((getline row < "2remove.txt") == 1) { if (row == $0) { m=1 ; break } } ; close("2remove.txt") ; if (m == 0) { print $0 }}'
.ski-acrobatique.net
.skiangpa.bilji.org
.skibble.info
.skibob.info
.skibone.com
.skibrooklyn.bestdeals.at
.skicero.bilji.org
.skidajtange.com
.skidarmy.bestdeals.at
.skidgalleries.com
.skidgaygalleries.com
.skidka-ddddd90.bestdeals.at
.skidka-dvsem.lookin.at
.skidman.com
.skiefjeff.info
 
Old 05-30-2013, 02:11 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Well I must say I am not sure how the awk would process much faster if the remove file is also large, however:
Code:
grep -vf 2remove.txt targetfile.txt

# or

awk 'NR==FNR{_[$0];next}{for(i in _)if($0 ~ i)next}1' 2remove.txt targetfile.txt
 
2 members found this post helpful.
Old 05-30-2013, 04:30 AM   #4
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 25

Original Poster
Rep: Reputation: Disabled
yeah grail thats what I wanted. Ygrex, that only deletes the matching object itself, which can be done several different ways which I allready have, I need the entire line containing a match to be deleted as grails example shows.

But, its still slow as hell. I guess im gunna need to figure a way to fork this, or maybe use GNU Parellel.
 
Old 05-30-2013, 05:48 AM   #5
Ygrex
Member
 
Registered: Nov 2004
Location: Russia (St.Petersburg)
Distribution: Debian
Posts: 666

Rep: Reputation: 68
the only sensible difference I see is == vs ~
 
Old 05-30-2013, 07:02 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Actually a really big difference is that you are reading the 2remove.txt every time through which I would think a bigger impact.
 
1 members found this post helpful.
Old 05-30-2013, 07:04 AM   #7
Ygrex
Member
 
Registered: Nov 2004
Location: Russia (St.Petersburg)
Distribution: Debian
Posts: 666

Rep: Reputation: 68
it does not affect results, that is what I mean as «sensible difference»
 
Old 05-30-2013, 08:13 AM   #8
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 25

Original Poster
Rep: Reputation: Disabled
Any suggestions for a better more efficient way to do this? I mean this works fine with smaller files.

But the target file has millions of lines, and the 2remove file is thousands, sometimes 10s of thousands.

Thanks for your time. I really, really appreciate it.
 
Old 05-30-2013, 08:51 AM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Quote:
Originally Posted by CaptainDerp View Post
Any suggestions for a better more efficient way to do this? I mean this works fine with smaller files.

But the target file has millions of lines, and the 2remove file is thousands, sometimes 10s of thousands.
You might have reached a point were the size of the files and the resources available are starting to become problematic.

Running in parallel won't solve anything (resources are still the same), it might even make things slower.

Looking at the answers given I would suggest using grails awk solution (post #3). Practical experience has shown me that a well written awk script is pretty efficient. Just be patient and wait for the results.

Not sure if this applies to you, but do be careful when running this on a production server. It will definitely have an impact and might even slow things down to a point were things become unusable/unresponsive.

You might want to run this job at a slow time (night?) and/or on a dedicated, non-production server.
 
Old 05-30-2013, 09:03 AM   #10
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 25

Original Poster
Rep: Reputation: Disabled
good advice, but

Good words of widsom, unfortunately these lists must be updated frequently, and waiting days for them to complete is simply not an option, well at the moment it would seem to be the only option.

However, I am considering simply dividing up the work, chopping up the target file into pieces and distributing them to workstations to speed up the process. While a crude and primitive method. It lends to my belief that GNU Parallel can be leveraged to achieve this in a more elegant fashion.

I just gotta figure out the commands.
 
Old 05-30-2013, 09:09 AM   #11
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,602

Rep: Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241
Use perl.

Now read the to-remove file into a hash (disk based to allow for really large lists).

Then for each record, check the to-remove file for a hash match (much faster than pattern matching) and only output if the record doesn't exist.

Using Perl hash files is fast and when disk based, the hashes are then cached in memory for those frequently identified.
 
1 members found this post helpful.
Old 05-30-2013, 09:34 AM   #12
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 25

Original Poster
Rep: Reputation: Disabled
Thumbs up

Quote:
Originally Posted by jpollard View Post
Use perl.

Now read the to-remove file into a hash (disk based to allow for really large lists).

Then for each record, check the to-remove file for a hash match (much faster than pattern matching) and only output if the record doesn't exist.

Using Perl hash files is fast and when disk based, the hashes are then cached in memory for those frequently identified.
Now there is a fresh idea (as far as my brain is concerned) Thanks!

I think im a gunna go down that yellow perl/brick road, I know first hand how advantageous cracking passwords can be using rainbow tables, and this sounds familiar in that regard, A googling I Go! Im a close this thread tho as solved and go bug the piss out of some perl monks, thanks again everyone!
 
Old 05-30-2013, 09:39 AM   #13
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,602

Rep: Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241Reputation: 1241
It is the same technique.

The rainbow files tend to be rather large (56 bit keys+salt is around 2 GB in size). The larger the key range is the larger the file. The goal is to have a file so large that it is impractical to store them.

Of course, having rainbow files for only the most common passwords/phrases would reduce that size, but it also introduces the likelihood of missing a password.
 
1 members found this post helpful.
  


Reply

Tags
awk


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to remove all objects BEFORE first whitespace using sed or awk? CaptainDerp Programming 9 03-24-2013 10:53 PM
AWK comparing 2 files and retrieving lines that have a matching field tekvaio Programming 1 11-27-2012 07:55 AM
SED or AWK - remove every 4 of 5 new lines Mallardle Linux - Newbie 6 08-30-2010 08:44 AM
[exim4] Remove header lines matching received client ip konddor Debian 1 10-23-2009 04:25 AM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 10:19 AM


All times are GMT -5. The time now is 10:10 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration