LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Social Security # Search (http://www.linuxquestions.org/questions/linux-newbie-8/social-security-search-872473/)

jv2112 04-01-2011 06:14 PM

Social Security # Search
 
I am looking at writing a script that will search the hard drive for matches certain sequences to identify files containing sensitive data to be scrubbed.

I started with the line below but it just runs on and on. I am not sure what I am doing wrong. :(

Any guidance would be appreciated.:hattip:


Quote:

sudo find . -type f -exec grep '[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}' {} \;


Telengard 04-01-2011 07:01 PM

I don't see any need to use find here. grep -R can recurse directories.

I think the repetition operator you are trying to use doesn't work the way you think it does. My lazy fix is to invoke egrep which I believe works the way you want it to. It would be great if someone more experienced with the grep family can expand on this a bit.

You could also explicitly state your character classes the appropriate number of times and leave out the repetition operators.

slimm609 04-01-2011 10:59 PM

grep -R '[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}' *

This could take a very long time to run depending on what the system specs are.

jv2112 04-02-2011 04:27 AM

Thanks for all the replies :hattip:


Any suggestions on how to speed up. Once I tested this I was thinking of placing it in a script with additional common sequences (ie credit card) the schedule in cron to generate a list I should review / scrub.


Thoughts :study:


Running on ->

Netbook Asus 100OE ( Atom processor(2 cores) 2 gig ram)

Desktop Custom build ( AMD 1090T (6 Cores) / 4 Gig RAM / Agility SSD Drive + 2 GBTS SATA drives)

Telengard 04-02-2011 01:08 PM

Quote:

Originally Posted by jv2112 (Post 4311409)
Any suggestions on how to speed up.

Yes.
  • Don't invoke any more processes than you absolutely must to get the job done. One example is eliminating the unneeded find command you were using.
  • Make full use of the capabilities of each program you invoke. grep -R consumes less resources than making a pipeline with another command. Wasteful constructs to avoid include things like cat somefile | grep something and grep something < somefile.
  • Whenever possible, use smaller/faster programs to get the job done. For example, cut can be much faster than awk; if cut will do the job then use it. Same applies to the grep, egrep, fgrep family; each is optimized to perform better under various conditions.
  • Consider making a C program instead of a Bash script. Languages which compile to native code can be many times faster than shell scripts.
  • Consider upgrading your computer hardware. More RAM, a faster processor, and a faster hard disk will improve performance system wide.

Quote:

Once I tested this I was thinking of placing it in a script with additional common sequences (ie credit card) the schedule in cron to generate a list I should review / scrub.
You can use the alternation operator | (pipe character) to separate multiple regular expressions. Keep in mind the order of precedence when mixing operators.

Quote:

Originally Posted by man grep
Precedence
Repetition takes precedence over concatenation, which in turn takes
precedence over alternation. A whole expression may be enclosed in
parentheses to override these precedence rules and form a
subexpression.

Code:

foo$ echo -e 'feel\nfoal\ntool\nteal\n' | grep 'ee\|oo'
feel
tool
foo$

HTH

Edit
My knowledge of the grep family is far from complete. It would be nice if someone with more knowledge would add more here.

jefro 04-02-2011 02:31 PM

The problem is that there is a lot more personal protected data that could be on there and also ssn data that may not be in your format. Depending on the apps or file format the numbers could be almost anywhere.

I'd wipe the drive.

Telengard 04-02-2011 03:36 PM

Quote:

Originally Posted by jefro (Post 4311719)
The problem is that there is a lot more personal protected data that could be on there and also ssn data that may not be in your format. Depending on the apps or file format the numbers could be almost anywhere.

I'd wipe the drive.

I thought of those things too, but in the spirit of being helpful I decided to go along with OP's premise anyway.

On the other hand, there is always dban.

jv2112 04-02-2011 04:02 PM

Thanks for all the input.


All times are GMT -5. The time now is 02:04 AM.