[SOLVED] [Nokogiri/Ruby] Extract captions and URLs from hyperlinks
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
[Nokogiri/Ruby] Extract captions and URLs from hyperlinks
Good morning.
I seek your advice to solve a programming problem with Nokogiri in a Ruby-program.
This will be tricky, not in code but in language. I may be unable to express my simple question in plain English and ask you to have patience; maybe I have to clarify things in a followup, below. So do not hesitate to demand details, if this is not clear, at first... (sigh).
Background.
My mail-user agent is Mutt (you knew it). And thus, those mails I get mostly from offices of all kinds of administration do NEVER conform to any standard. I write “thus”, because Mutt gives me a chance to notice.
One of those problems with HTML-only mail, I solved with a script which extracts hyperlinks and lists them below the mail-text, together with the actual caption of the link in the mail. I detect hyperlinks where I had not even suspected them, before. The script is called from a Mutt-Macro, so I do not have to leave the mail-client.
Problem.
This does not always work as it should.
Example.
... Wait a moment. This mail does not convey any private data. I attach it below. A good example which may spare me more words. The file is in eml-format and contains all parts, meaning, that you can compare the html- to a plaintext version.
If I have no plaintext part in a mail, all I see is it's html-version and in Mutt, none of the links would be visible.
When I let my script munge this very mail (it's html-part), I get a list of defective links like this: Screen Shot.
Log-output.
In the debug-log I find part of the answer:
The link is badly handled, and what was an URL like http://eye.sbc09.com/m2?r=pTE1NDA5xBAUdld90NDQjEhD0J7Qsevh0NNwZ9DfxBDQzNCIPEbQzgfQqEbQkNCg0Jb_0NZV0KnQxrxtaWNoYWVsLnVwbGF 3c2tpQHVwbGF3c2tpLmV1oJLEEF
+MIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQgbWFyYcOuY2hhZ2XEEAns0KzQsdDeSE9O0LTQgfln0KlL0NsIr0FORUZBIE5P Uk1BTkRJRQ== with a caption « Afficher le message dans mon navigateur » has become
Code:
a is <a href="3D%22http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=" hd0j7qsevh0nnwz9dfxbdqzncipebqzgfqqebqkncg0jb_0nzv0knqxrxtawnoywvslnvwbgf3c="2tpQHVwbGF3c2tpLmV1oJLEEFMIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQg=" bwfyycouy2hhz2xeeans0kzqsddese9o0ltqgfln0kll0nsir0foruzbie5puk1btkrjrq="3D=" target='3D"_blank"' style='3D"text-decoration:' none color:>Aff=
icher le message dans mon navigateur</a>
MailLinks: DEBUG 8-41-54: href is: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=
MailLinks: DEBUG 8-41-54: link is ex href.content: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=
Edit:
This shows, that the URL is chopped into pieces, the first time, around the sequence
“NDQjEhD0J”, where it becomes “NDQJE=” and the remainder appears in the a-node after the href-attribute is already closed, in the way of an additional attribute...
This lets me assume a bug in Nokogiri.
Code
The code which produces the above log-output is quite straight forward. First I create, somewhere, a node-list with all the hyperlinks:
Code:
a_nodes = html_mail.xpath('.//a')
A few lines further down, a new HTML-structure is created, which will be displayed in w3m, in the end:
Code:
a_nodes.each_with_index do |a, i|
@log.debug('a is ' << a.to_s)
caption = a.inner_text.to_s.strip
href = a.attribute('href')
if(href)
@log.debug('href is: ' << href.to_s)
link = href.content
@log.debug('link is ex href.content: ' << link.to_s)
a.add_next_sibling("<span>[" << (i.next).to_s << "]</span>")
dl.add_child("<dt>%i) %s</dt>"%[i.next, caption])
dl.add_child("<dd style='white-space:nowrap;'><a href='%s'>%s</a></dd>"%[link,link])
end
end
That's it. The utility is called “maillinks” and I appear to have published a gem-version in the past. As the version number is the same as my local gem, the available gem should be identical to the program I used above...
Question
What shall I do?
Last edited by Michael Uplawski; 04-03-2018 at 07:49 AM.
Okay, I got confused, because my routine failed only “sometimes”.
Reason: What I called a chopped URL, above, is actually already cut into pieces in the original mail! This has nothing to do with Quoted-Printable or Nokogiri.
I must find a trailing '=' right before a line-break inside the URL (href value), then eliminate it before joining the current string with the following line; and all before I let Nokogiri parse any HTML-mail.
user@machine~$ ri Mail::Encodings::QuotedPrintable.decode
(from gem mail-2.7.0)
=== Implementation from QuotedPrintable
------------------------------------------------------------------------
decode(str)
------------------------------------------------------------------------
Decode the string from Quoted-Printable. Cope with hard line breaks that
were incorrectly encoded as hex instead of literal CRLF.
Here is the html-version of the man-page for maillinks.
Edit: Halleluja. Now even the HTML-body of the original mail is formatted correctly... and looks much nicer than I am ready to admit.
Last edited by Michael Uplawski; 04-04-2018 at 12:44 AM.
Reason: confusing the confusion will not make it go away
I see this often in "raw" email...that is, unprocessed by any MUA. I have always presumed the "=" at the end of each line were somehow the result of MIME en/decoding. Here's an snippet of a "multi-part message in MIME format"...a spam message I'll be reporting to the sending ISP.
Code:
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<div style=3D"background:#328EDF;border: 1px solid #ABC8E2; padding:45px 0 =
0; max-width:700px;font-family:helvetica; line-height:26px;">
<table cellspacing=3D"0" cellpadding=3D"0" style=3D"margin:0 auto;width=
: 100%;line-height:initial;">
<tr background=3D"" style=3D"text-align:center;">
<td style=3D"vertical-align:top;">
</td>
</tr>
<tr>
<td style=3D"text-align:left;font-size:15px;">
<div style=3D"background:#ffffff;padding: 14px 77px; margin:0 auto; col=
or:#333;line-height: 28px;min-height: 150px;">
=09
<div>Hi <span style=3D"font-weight: bold;">Mailer-Daemon,</span></div>H=
i,Wished to check if you would be interested in achieving Six Sigma Gr=
een Belt (SSGB) Training and Certification in 3 Days at your loca=
tion.<span><span>Batch 1: April 09th To April 11th 2018<br>=
[snip]
Note that the trailing = appears to be a function of line length, or something. It's not specific to URLs, but I can see how it's messing up your process.
When I've had occasion to (manually) tweak this, I've just replaced a = at end of line with nothing and deleted the EOL:
Code:
sed 's/=$//'
(but I don't know how to program the join to the next line...sorry)
Maybe you already know all that...just wanted to share that it's not a URL specific thing.
just wanted to share that it's not a URL specific thing.
Yes, it has become obvious finally, when I used the Mail-gem to “decode & repair” the whole message-body prior analyzing. There is a new version of my gem available on rubygems.org (link dysfunctional at 05:52 UTC).
... Now I receive “Email” which is basically empty with just 1 link to an attached PNG, containing all the message.., or just the result of some designer playing with her/his tools. I would not object a PDF containing a hand-written letter and did that myself, already. But you should be warned in the message-body.
The Internet is very sick.
Last edited by Michael Uplawski; 04-04-2018 at 12:53 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.