Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I am looking at writing a script that will search the hard drive for matches certain sequences to identify files containing sensitive data to be scrubbed.
I started with the line below but it just runs on and on. I am not sure what I am doing wrong.
Any guidance would be appreciated.
Quote:
sudo find . -type f -exec grep '[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}' {} \;
I don't see any need to use find here. grep -R can recurse directories.
I think the repetition operator you are trying to use doesn't work the way you think it does. My lazy fix is to invoke egrep which I believe works the way you want it to. It would be great if someone more experienced with the grep family can expand on this a bit.
You could also explicitly state your character classes the appropriate number of times and leave out the repetition operators.
Any suggestions on how to speed up. Once I tested this I was thinking of placing it in a script with additional common sequences (ie credit card) the schedule in cron to generate a list I should review / scrub.
Thoughts
Running on ->
Netbook Asus 100OE ( Atom processor(2 cores) 2 gig ram)
Don't invoke any more processes than you absolutely must to get the job done. One example is eliminating the unneeded find command you were using.
Make full use of the capabilities of each program you invoke. grep -R consumes less resources than making a pipeline with another command. Wasteful constructs to avoid include things like cat somefile | grep something and grep something < somefile.
Whenever possible, use smaller/faster programs to get the job done. For example, cut can be much faster than awk; if cut will do the job then use it. Same applies to the grep, egrep, fgrep family; each is optimized to perform better under various conditions.
Consider making a C program instead of a Bash script. Languages which compile to native code can be many times faster than shell scripts.
Consider upgrading your computer hardware. More RAM, a faster processor, and a faster hard disk will improve performance system wide.
Quote:
Once I tested this I was thinking of placing it in a script with additional common sequences (ie credit card) the schedule in cron to generate a list I should review / scrub.
You can use the alternation operator | (pipe character) to separate multiple regular expressions. Keep in mind the order of precedence when mixing operators.
Quote:
Originally Posted by man grep
Precedence
Repetition takes precedence over concatenation, which in turn takes
precedence over alternation. A whole expression may be enclosed in
parentheses to override these precedence rules and form a
subexpression.
The problem is that there is a lot more personal protected data that could be on there and also ssn data that may not be in your format. Depending on the apps or file format the numbers could be almost anywhere.
The problem is that there is a lot more personal protected data that could be on there and also ssn data that may not be in your format. Depending on the apps or file format the numbers could be almost anywhere.
I'd wipe the drive.
I thought of those things too, but in the spirit of being helpful I decided to go along with OP's premise anyway.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.