[SOLVED] AWK: Remove of Lines matching from a supplied List of Objects?
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
AWK: Remove of Lines matching from a supplied List of Objects?
Does anyone here know how to remove any line that contains a match, from a supplied list of objects? I have a list of 1 million lines.
I need to remove each line that contains a match from a supplied list, I do not need to merely remove the objects that match, but the entire line.
I did not want to do this with grep or sed, Ive got scripts for grep and sed that do it, but they are sooo much slower than AWK. AWK seems to be lightning fast.
yeah grail thats what I wanted. Ygrex, that only deletes the matching object itself, which can be done several different ways which I allready have, I need the entire line containing a match to be deleted as grails example shows.
But, its still slow as hell. I guess im gunna need to figure a way to fork this, or maybe use GNU Parellel.
Any suggestions for a better more efficient way to do this? I mean this works fine with smaller files.
But the target file has millions of lines, and the 2remove file is thousands, sometimes 10s of thousands.
You might have reached a point were the size of the files and the resources available are starting to become problematic.
Running in parallel won't solve anything (resources are still the same), it might even make things slower.
Looking at the answers given I would suggest using grails awk solution (post #3). Practical experience has shown me that a well written awk script is pretty efficient. Just be patient and wait for the results.
Not sure if this applies to you, but do be careful when running this on a production server. It will definitely have an impact and might even slow things down to a point were things become unusable/unresponsive.
You might want to run this job at a slow time (night?) and/or on a dedicated, non-production server.
Good words of widsom, unfortunately these lists must be updated frequently, and waiting days for them to complete is simply not an option, well at the moment it would seem to be the only option.
However, I am considering simply dividing up the work, chopping up the target file into pieces and distributing them to workstations to speed up the process. While a crude and primitive method. It lends to my belief that GNU Parallel can be leveraged to achieve this in a more elegant fashion.
Now read the to-remove file into a hash (disk based to allow for really large lists).
Then for each record, check the to-remove file for a hash match (much faster than pattern matching) and only output if the record doesn't exist.
Using Perl hash files is fast and when disk based, the hashes are then cached in memory for those frequently identified.
Now there is a fresh idea (as far as my brain is concerned) Thanks!
I think im a gunna go down that yellow perl/brick road, I know first hand how advantageous cracking passwords can be using rainbow tables, and this sounds familiar in that regard, A googling I Go! Im a close this thread tho as solved and go bug the piss out of some perl monks, thanks again everyone!
The rainbow files tend to be rather large (56 bit keys+salt is around 2 GB in size). The larger the key range is the larger the file. The goal is to have a file so large that it is impractical to store them.
Of course, having rainbow files for only the most common passwords/phrases would reduce that size, but it also introduces the likelihood of missing a password.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.