LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-09-2018, 07:48 AM   #1
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Pattern match/replace over multiple lines


My client has 20 GB of e-mails, hundreds or thousands of them containing credit card numbers which their customers sent to them during the past few years. My client wants to get rid of the credit card numbers because it is a security risk in case the mail system is compromised. And after this cleaning up action I should regularly run a script to keep everything clean.

The e-mails should be preserved in case the agents want to look back in the e-mails to see the customer history, like prices, offers, discounts, etc.

Currently I have an awk script which is called on each mail file. The script looks for viable number sequences (like 1234 5678 9012 2344 or 1234567890123456) and does a credit card checksum check. If it is a valid credit card number, 12 digits are replaced by asterisks.

This works for the plain text part of the e-mails. But I have overlooked that in the HTML part of the mails the number does not appear on one single line. It looks like this:
Code:
style=3D"font-size:10.0pt;font-family:"Arial","sans-serif&=
quot;;color:black">Credit card number: 1234 5678 9012 =
3456</span><o:p></o:p></p></td></tr><tr style=3D"height:13.1pt"><td =
width=3D"515" nowrap=3D"" valign=3D"bottom" =
I think I cannot make any assumptions about how many digits are on one line and how many on the next.

Further notes to the solution:
  • It is nice to have to replace the number with asterisks, but wiping out the number is OK
  • The HTML formatting does not have to be perfectly preserved. Wiping out the compete table row is OK
  • The rest of the contents must remain readable
  • It cannot be assumed the card number is always preceeded by "Credit card" or any other text
  • In the complete collection of mails, mail size can be very large. I have seen attachments of 32MB. Which is a UUEncoded part of the mail file.
  • Usually the credit card number is somewhere in the plain text and recognized as such. I am not sure that knowledge can be used. Neither do I know if the plain text part always precedes the HTML
  • I don't speak Perl, but I am not against using it
  • File processing is one at a time. A Bash script takes care of calling the credit card wipe-out script with the file name as parameter.
  • File processing can be in-place or the output can be written to another file
I would be happy with something which can find a viable number by regexp. But it should be taken into account that for each viable number a checksum check (Luhn's check) must be performed. That seems to rule out sed and require a real script language.

jlinkels
 
Old 08-09-2018, 08:03 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,803

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
I would suggest you to prepare a sample text (with modified, invalid info) and/or use an online regexp engine to test your regexp.
You can hardly parse html/xml with sed/grep, you ought to use perl/python or something similar.
What you specified is not really enough to catch credit card numbers, therefore hard to give any solution.
 
Old 08-09-2018, 09:07 AM   #3
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Original Poster
Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Thumbs up

I am not parsing HTML or XML. I just want to be able to recognize any viable number anywhere in the file. There are no constraints where the number appear or between which tags they appear. Even if it is a large block of UUEncoded data I would find tens or hundreds number sequences. Most will not pass the checksum test.

These are the regexp I use in awk for numbers separated by spaces or hyphens. Once I found the number I do the checksum check.
Code:
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}/ );
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{3}/ );
    }
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{2}/ );
    }
    if (nstart == 0) {
    nstart=match($0, /[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{4}(\ |\-)[0-9]{3}/ );
    }
    if (nstart != 0) {
       Check checksum
    }
The only problem I have is that when the number is part of the HTML code, it can contain a line break, preceeded by a '=' character. And I cannot make assumptions about where the line appears.

jlinkels
 
Old 08-09-2018, 09:07 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
There are a bunch of grep derivatives that handle multi-line - go look for one that suits. Typically PCRE is the answer - and again grep will do the job with the "-P" flag. But you can't do anything in any "language" until you can construct the regex.
For small files (say less than 80% of RAM), simply slurp it into memory, and handle it immediately.
 
Old 08-09-2018, 09:12 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Concurrent posts crossed "in flight" ...
awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.
 
Old 08-09-2018, 11:01 AM   #6
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Original Poster
Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Quote:
Originally Posted by syg00 View Post
awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.
Yes, it never occurred to me that the HTML part had line breaks in parts which are displayed on the same line. And wondered why an e-mail client still showed the full number (. Let me start to check what the hex code exactly shows as EOL sequence and take it from there.
I am also wondering if the HTML display is damaged when I change or delete the EOL marks.

jlinkels
 
Old 08-09-2018, 11:22 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,803

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
for awk (for example) you can try to set line delimiter to everything but numbers, space and whatever you want.
After that you can remove anything but numbers from the "line" (including space, =, newline) and do a simple check (if there are enough numbers ...). Print the results and pipe the result into a python/perl/java/whatever script where you can make a more strict check.
If that works you can reimplement this awk in a single perl/python and do the check inside the script and make also the replacement if required.

Obviously you can start immediately in perl, but if you are not familiar with it probably better to start with something else.
 
Old 08-09-2018, 02:42 PM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Given the irregularity, I'd guess that perl would be the easiest way to convert from MIME quoted-printable to HTML and then use an XPath parser to extract the data from the HTML. That would probably mean the modules MIME::QuotedPrint and HTML::TreeBuilder::XPath. For the latter, without a larger excerpt, I'd have to guess that the XPath would be something like this: //tr/td/p/span[contains(text(),"Credit")]
 
Old 08-10-2018, 04:54 PM   #9
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Original Poster
Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
I wrote a python script which reads all data in a string, so including newlines etc. Then I searched for viable number sequences:
Code:
re.finditer('[0-9 =\n]{14,25}', data)
Which will match down to 14 consecutive numbers (the shortest credit card number) and up to 19 digits with 4 spaces in between and a "=\n" sequence. The latter is a line break in MIME quoted printable.
It is not a perfect regexp, but the problem is that the "=\n" sequence can occur anywhere in the string.
After the matches are found, all non-digit characters are deleted from the string and each string is tested against the Luhn test.
Writing all data to a file, replacing the matched and tested sequences with "*" instead of digits is trivial.

jlinkels
 
Old 08-10-2018, 05:59 PM   #10
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,725

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Quote:
Originally Posted by jlinkels View Post
Yes, it never occurred to me that the HTML part had line breaks in parts which are displayed on the same line. And wondered why an e-mail client still showed the full number (. Let me start to check what the hex code exactly shows as EOL sequence and take it from there.
I am also wondering if the HTML display is damaged when I change or delete the EOL marks.

jlinkels
HTML (well, more accurately, the browser) doesn't see line breaks at all, just as it doesn't see more than one consecutive space. So breaking up a number onto two or more lines doesn't change how it's displayed.

I see you've found a way...
 
Old 08-11-2018, 06:13 AM   #11
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Original Poster
Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Quote:
Originally Posted by scasey View Post
HTML (well, more accurately, the browser) doesn't see line breaks at all, just as it doesn't see more than one consecutive space. So breaking up a number onto two or more lines doesn't change how it's displayed.
MIME quoted printable format has a maximum line length of 76 characters. The raw email file has line breaks after the 76th column preceeded by a "=" character. That makes pattern matching using grep and awk difficult or impossible.
I don't care about display, HTML, XML or whatever, I needed to find patterns in a file.

jlinkels
 
Old 08-11-2018, 06:35 AM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by jlinkels View Post
That makes pattern matching using grep and awk difficult or impossible.
??? ... hmmm. For you maybe.
 
Old 08-11-2018, 06:50 AM   #13
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Original Poster
Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Quote:
Originally Posted by syg00 View Post
awk is stream orientated - the code doesn't see the newline. The only way I know of is to fudge EOR with a null record - I've been known to insert a null line so awk can do the job, but it is a kludge.
perl is starting to look the best bet.
It seems that is what you said as well
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
After the first pattern match replace command rabittn Programming 9 12-07-2014 11:07 AM
Replace nth line after the pattern match sras Solaris / OpenSolaris 9 08-07-2013 06:12 AM
Match pattern and replace sol_nov Programming 7 11-30-2009 08:23 PM
help extracting a matching pattern and next lines of match madvicious Programming 8 09-13-2009 01:01 AM
replacement with sed: replace pattern with multiple lines Hcman Programming 5 11-18-2004 07:40 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration