[SOLVED] AWK: Remove of Lines matching from a supplied List of Objects?

CaptainDerp · 05-29-2013, 07:39 PM

Does anyone here know how to remove any line that contains a match, from a supplied list of objects? I have a list of 1 million lines.

I need to remove each line that contains a match from a supplied list, I do not need to merely remove the objects that match, but the entire line.

I did not want to do this with grep or sed, Ive got scripts for grep and sed that do it, but they are sooo much slower than AWK. AWK seems to be lightning fast.

Ok so heres a sample of my dillema

targetfile.txt

Code:

.ski-acrobatique.net
.skiangpa.bilji.org
.skibble.info
.skibob.info
.skibone.com
.bilji.org
.skibrooklyn.bestdeals.at
.skicero.bilji.org
.skidajtange.com
.skidarmy.bestdeals.at
.skidgalleries.com
.skidgaygalleries.com
.skidka-ddddd90.bestdeals.at
.skidka-dvsem.lookin.at
.skidman.com
.skiefjeff.info
.bestdeals.at

2remove.txt

Code:

.bber.info
.bestdeals.at
.bestfor.ru
.bfrazeredit.com
.bilji.org
.bitat.com
.bitches-xui.info
.bitcomet.com
.bitreactor.to
.blackvalt.com
.blissdisplays.com

Ygrex · 05-29-2013, 11:29 PM

if 2remove.txt file is huge:

Code:

$ cat targetfile.txt | awk '{ m=0 ; while ((getline row < "2remove.txt") == 1) { if (row == $0) { m=1 ; break } } ; close("2remove.txt") ; if (m == 0) { print $0 }}'
.ski-acrobatique.net
.skiangpa.bilji.org
.skibble.info
.skibob.info
.skibone.com
.skibrooklyn.bestdeals.at
.skicero.bilji.org
.skidajtange.com
.skidarmy.bestdeals.at
.skidgalleries.com
.skidgaygalleries.com
.skidka-ddddd90.bestdeals.at
.skidka-dvsem.lookin.at
.skidman.com
.skiefjeff.info

grail · 05-30-2013, 01:11 AM

Well I must say I am not sure how the awk would process much faster if the remove file is also large, however:

Code:

grep -vf 2remove.txt targetfile.txt

# or

awk 'NR==FNR{_[$0];next}{for(i in _)if($0 ~ i)next}1' 2remove.txt targetfile.txt

CaptainDerp · 05-30-2013, 03:30 AM

yeah grail thats what I wanted. Ygrex, that only deletes the matching object itself, which can be done several different ways which I allready have, I need the entire line containing a match to be deleted as grails example shows.

But, its still slow as hell. I guess im gunna need to figure a way to fork this, or maybe use GNU Parellel.

Ygrex · 05-30-2013, 04:48 AM

the only sensible difference I see is == vs ~

grail · 05-30-2013, 06:02 AM

Actually a really big difference is that you are reading the 2remove.txt every time through which I would think a bigger impact.

Ygrex · 05-30-2013, 06:04 AM

it does not affect results, that is what I mean as «sensible difference»

CaptainDerp · 05-30-2013, 07:13 AM

Any suggestions for a better more efficient way to do this? I mean this works fine with smaller files.

But the target file has millions of lines, and the 2remove file is thousands, sometimes 10s of thousands.

Thanks for your time. I really, really appreciate it.

druuna · 05-30-2013, 07:51 AM

Quote:

Originally Posted by CaptainDerp

Any suggestions for a better more efficient way to do this? I mean this works fine with smaller files.

But the target file has millions of lines, and the 2remove file is thousands, sometimes 10s of thousands.

You might have reached a point were the size of the files and the resources available are starting to become problematic.

Running in parallel won't solve anything (resources are still the same), it might even make things slower.

Looking at the answers given I would suggest using grails awk solution (post #3). Practical experience has shown me that a well written awk script is pretty efficient. Just be patient and wait for the results.

Not sure if this applies to you, but do be careful when running this on a production server. It will definitely have an impact and might even slow things down to a point were things become unusable/unresponsive.

You might want to run this job at a slow time (night?) and/or on a dedicated, non-production server.

CaptainDerp · 05-30-2013, 08:03 AM

Good words of widsom, unfortunately these lists must be updated frequently, and waiting days for them to complete is simply not an option, well at the moment it would seem to be the only option.

However, I am considering simply dividing up the work, chopping up the target file into pieces and distributing them to workstations to speed up the process. While a crude and primitive method. It lends to my belief that GNU Parallel can be leveraged to achieve this in a more elegant fashion.

I just gotta figure out the commands.

jpollard · 05-30-2013, 08:09 AM

Use perl.

Now read the to-remove file into a hash (disk based to allow for really large lists).

Then for each record, check the to-remove file for a hash match (much faster than pattern matching) and only output if the record doesn't exist.

Using Perl hash files is fast and when disk based, the hashes are then cached in memory for those frequently identified.

CaptainDerp · 05-30-2013, 08:34 AM

Quote:

Originally Posted by jpollard

Use perl.

Now read the to-remove file into a hash (disk based to allow for really large lists).

Then for each record, check the to-remove file for a hash match (much faster than pattern matching) and only output if the record doesn't exist.

Using Perl hash files is fast and when disk based, the hashes are then cached in memory for those frequently identified.

Now there is a fresh idea (as far as my brain is concerned) Thanks!

I think im a gunna go down that yellow perl/brick road, I know first hand how advantageous cracking passwords can be using rainbow tables, and this sounds familiar in that regard, A googling I Go! Im a close this thread tho as solved and go bug the piss out of some perl monks, thanks again everyone!

jpollard · 05-30-2013, 08:39 AM

It is the same technique.

The rainbow files tend to be rather large (56 bit keys+salt is around 2 GB in size). The larger the key range is the larger the file. The goal is to have a file so large that it is impractical to store them.

Of course, having rainbow files for only the most common passwords/phrases would reduce that size, but it also introduces the likelihood of missing a password.