[SOLVED] Pattern match/replace over multiple lines

jlinkels · 08-09-2018, 07:48 AM

My client has 20 GB of e-mails, hundreds or thousands of them containing credit card numbers which their customers sent to them during the past few years. My client wants to get rid of the credit card numbers because it is a security risk in case the mail system is compromised. And after this cleaning up action I should regularly run a script to keep everything clean.

The e-mails should be preserved in case the agents want to look back in the e-mails to see the customer history, like prices, offers, discounts, etc.

Currently I have an awk script which is called on each mail file. The script looks for viable number sequences (like 1234 5678 9012 2344 or 1234567890123456) and does a credit card checksum check. If it is a valid credit card number, 12 digits are replaced by asterisks.

This works for the plain text part of the e-mails. But I have overlooked that in the HTML part of the mails the number does not appear on one single line. It looks like this:

Code:

style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&=
quot;;color:black">Credit card number: 1234 5678 9012 =
3456</span><o:p></o:p></p></td></tr><tr style=3D"height:13.1pt"><td =
width=3D"515" nowrap=3D"" valign=3D"bottom" =

I think I cannot make any assumptions about how many digits are on one line and how many on the next.

Further notes to the solution:

It is nice to have to replace the number with asterisks, but wiping out the number is OK
The HTML formatting does not have to be perfectly preserved. Wiping out the compete table row is OK
The rest of the contents must remain readable
It cannot be assumed the card number is always preceeded by "Credit card" or any other text
In the complete collection of mails, mail size can be very large. I have seen attachments of 32MB. Which is a UUEncoded part of the mail file.
Usually the credit card number is somewhere in the plain text and recognized as such. I am not sure that knowledge can be used. Neither do I know if the plain text part always precedes the HTML
I don't speak Perl, but I am not against using it
File processing is one at a time. A Bash script takes care of calling the credit card wipe-out script with the file name as parameter.
File processing can be in-place or the output can be written to another file

I would be happy with something which can find a viable number by regexp. But it should be taken into account that for each viable number a checksum check (Luhn's check) must be performed. That seems to rule out sed and require a real script language.

jlinkels

pan64 · 08-09-2018, 08:03 AM

I would suggest you to prepare a sample text (with modified, invalid info) and/or use an online regexp engine to test your regexp.
You can hardly parse html/xml with sed/grep, you ought to use perl/python or something similar.
What you specified is not really enough to catch credit card numbers, therefore hard to give any solution.

jlinkels · 08-09-2018, 09:07 AM

I am not parsing HTML or XML. I just want to be able to recognize any viable number anywhere in the file. There are no constraints where the number appear or between which tags they appear. Even if it is a large block of UUEncoded data I would find tens or hundreds number sequences. Most will not pass the checksum test.

These are the regexp I use in awk for numbers separated by spaces or hyphens. Once I found the number I do the checksum check.

Code:

    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}/ );
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{3}/ );
    }
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{2}/ );
    }
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{3}/ );
    }
    if (nstart != 0) {
       Check checksum
    }

The only problem I have is that when the number is part of the HTML code, it can contain a line break, preceeded by a '=' character. And I cannot make assumptions about where the line appears.

jlinkels

syg00 · 08-09-2018, 09:07 AM

There are a bunch of grep derivatives that handle multi-line - go look for one that suits. Typically PCRE is the answer - and again grep will do the job with the "-P" flag. But you can't do anything in any "language" until you can construct the regex.
For small files (say less than 80% of RAM), simply slurp it into memory, and handle it immediately.

syg00 · 08-09-2018, 09:12 AM

Concurrent posts crossed "in flight" ...

awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.

jlinkels · 08-09-2018, 11:01 AM

Quote:

Originally Posted by syg00

awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.

Yes, it never occurred to me that the HTML part had line breaks in parts which are displayed on the same line.

And wondered why an e-mail client still showed the full number (. Let me start to check what the hex code exactly shows as EOL sequence and take it from there.
I am also wondering if the HTML display is damaged when I change or delete the EOL marks.

jlinkels

pan64 · 08-09-2018, 11:22 AM

for awk (for example) you can try to set line delimiter to everything but numbers, space and whatever you want.
After that you can remove anything but numbers from the "line" (including space, =, newline) and do a simple check (if there are enough numbers ...). Print the results and pipe the result into a python/perl/java/whatever script where you can make a more strict check.
If that works you can reimplement this awk in a single perl/python and do the check inside the script and make also the replacement if required.

Obviously you can start immediately in perl, but if you are not familiar with it probably better to start with something else.

Turbocapitalist · 08-09-2018, 02:42 PM

Given the irregularity, I'd guess that perl would be the easiest way to convert from MIME quoted-printable to HTML and then use an XPath parser to extract the data from the HTML. That would probably mean the modules MIME::QuotedPrint and HTML::TreeBuilder::XPath. For the latter, without a larger excerpt, I'd have to guess that the XPath would be something like this: //tr/td/p/span[contains(text(),"Credit")]

jlinkels · 08-10-2018, 04:54 PM

I wrote a python script which reads all data in a string, so including newlines etc. Then I searched for viable number sequences:

Code:

re.finditer('[0-9 =\n]{14,25}', data)

Which will match down to 14 consecutive numbers (the shortest credit card number) and up to 19 digits with 4 spaces in between and a "=\n" sequence. The latter is a line break in MIME quoted printable.
It is not a perfect regexp, but the problem is that the "=\n" sequence can occur anywhere in the string.
After the matches are found, all non-digit characters are deleted from the string and each string is tested against the Luhn test.
Writing all data to a file, replacing the matched and tested sequences with "*" instead of digits is trivial.

jlinkels

scasey · 08-10-2018, 05:59 PM

Quote:

Originally Posted by jlinkels

Yes, it never occurred to me that the HTML part had line breaks in parts which are displayed on the same line.

And wondered why an e-mail client still showed the full number (. Let me start to check what the hex code exactly shows as EOL sequence and take it from there.
I am also wondering if the HTML display is damaged when I change or delete the EOL marks.

jlinkels

HTML (well, more accurately, the browser) doesn't see line breaks at all, just as it doesn't see more than one consecutive space. So breaking up a number onto two or more lines doesn't change how it's displayed.

I see you've found a way...

jlinkels · 08-11-2018, 06:13 AM

Quote:

Originally Posted by scasey

HTML (well, more accurately, the browser) doesn't see line breaks at all, just as it doesn't see more than one consecutive space. So breaking up a number onto two or more lines doesn't change how it's displayed.

MIME quoted printable format has a maximum line length of 76 characters. The raw email file has line breaks after the 76th column preceeded by a "=" character. That makes pattern matching using grep and awk difficult or impossible.
I don't care about display, HTML, XML or whatever, I needed to find patterns in a file.

jlinkels

syg00 · 08-11-2018, 06:35 AM

Quote:

Originally Posted by jlinkels

That makes pattern matching using grep and awk difficult or impossible.

??? ... hmmm. For you maybe.

jlinkels · 08-11-2018, 06:50 AM

Quote:

Originally Posted by syg00

awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.

It seems that is what you said as well