[Nokogiri/Ruby] Extract captions and URLs from hyperlinks
2 Attachment(s)
Good morning.
I seek your advice to solve a programming problem with Nokogiri in a Ruby-program. This will be tricky, not in code but in language. I may be unable to express my simple question in plain English and ask you to have patience; maybe I have to clarify things in a followup, below. So do not hesitate to demand details, if this is not clear, at first... (sigh). Background. My mail-user agent is Mutt (you knew it). And thus, those mails I get mostly from offices of all kinds of administration do NEVER conform to any standard. I write “thus”, because Mutt gives me a chance to notice.Problem. This does not always work as it should.Example. ... Wait a moment. This mail does not convey any private data. I attach it below. A good example which may spare me more words. The file is in eml-format and contains all parts, meaning, that you can compare the html- to a plaintext version.Log-output. In the debug-log I find part of the answer:Edit: This shows, that the URL is chopped into pieces, the first time, around the sequenceThis lets me assume a bug in Nokogiri. Code The code which produces the above log-output is quite straight forward. First I create, somewhere, a node-list with all the hyperlinks:That's it. The utility is called “maillinks” and I appear to have published a gem-version in the past. As the version number is the same as my local gem, the available gem should be identical to the program I used above... Question What shall I do? |
Forgot, and because it is so nicely formated:
Code:
user@machine:~$ nokogiri --version |
Okay, I got confused, because my routine failed only “sometimes”.
Reason: What I called a chopped URL, above, is actually already cut into pieces in the original mail! This has nothing to do with Quoted-Printable or Nokogiri. I must find a trailing '=' right before a line-break inside the URL (href value), then eliminate it before joining the current string with the following line; and all before I let Nokogiri parse any HTML-mail. Who has invented this... |
Or simpler, still:
Code:
gem install mail Code:
require 'mail' Code:
user@machine~$ ri Mail::Encodings::QuotedPrintable.decode Edit: Halleluja. Now even the HTML-body of the original mail is formatted correctly... and looks much nicer than I am ready to admit. |
I see this often in "raw" email...that is, unprocessed by any MUA. I have always presumed the "=" at the end of each line were somehow the result of MIME en/decoding. Here's an snippet of a "multi-part message in MIME format"...a spam message I'll be reporting to the sending ISP.
Code:
Content-Type: text/html; charset=utf-8 When I've had occasion to (manually) tweak this, I've just replaced a = at end of line with nothing and deleted the EOL: Code:
sed 's/=$//' Maybe you already know all that...just wanted to share that it's not a URL specific thing. |
Quote:
... Now I receive “Email” which is basically empty with just 1 link to an attached PNG, containing all the message.., or just the result of some designer playing with her/his tools. I would not object a PDF containing a hand-written letter and did that myself, already. But you should be warned in the message-body. The Internet is very sick. |
All times are GMT -5. The time now is 05:00 AM. |