LinuxQuestions.org - [SOLVED] [Nokogiri/Ruby] Extract captions and URLs from hyperlinks

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - [Nokogiri/Ruby] Extract captions and URLs from hyperlinks (https://www.linuxquestions.org/questions/programming-9/%5Bnokogiri-ruby%5D-extract-captions-and-urls-from-hyperlinks-4175626902/)

[Nokogiri/Ruby] Extract captions and URLs from hyperlinks

Good morning.

I seek your advice to solve a programming problem with Nokogiri in a Ruby-program.

This will be tricky, not in code but in language. I may be unable to express my simple question in plain English and ask you to have patience; maybe I have to clarify things in a followup, below. So do not hesitate to demand details, if this is not clear, at first... (sigh).

Background.

My mail-user agent is Mutt (you knew it). And thus, those mails I get mostly from offices of all kinds of administration do NEVER conform to any standard. I write “thus”, because Mutt gives me a chance to notice.

One of those problems with HTML-only mail, I solved with a script which extracts hyperlinks and lists them below the mail-text, together with the actual caption of the link in the mail. I detect hyperlinks where I had not even suspected them, before. The script is called from a Mutt-Macro, so I do not have to leave the mail-client.

Problem.

This does not always work as it should.

Example.

... Wait a moment. This mail does not convey any private data. I attach it below. A good example which may spare me more words. The file is in eml-format and contains all parts, meaning, that you can compare the html- to a plaintext version.

If I have no plaintext part in a mail, all I see is it's html-version and in Mutt, none of the links would be visible.

When I let my script munge this very mail (it's html-part), I get a list of defective links like this:
Screen Shot.

Log-output.

In the debug-log I find part of the answer:
The link is badly handled, and what was an URL like
http://eye.sbc09.com/m2?r=pTE1NDA5xBAUdld90NDQjEhD0J7Qsevh0NNwZ9DfxBDQzNCIPEbQzgfQqEbQkNCg0Jb_0NZV0KnQxrxtaWNoYWVsLnVwbGF 3c2tpQHVwbGF3c2tpLmV1oJLEEF
+MIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQgbWFyYcOuY2hhZ2XEEAns0KzQsdDeSE9O0LTQgfln0KlL0NsIr0FORUZBIE5P Uk1BTkRJRQ== with a caption « Afficher le message dans mon navigateur » has become

Code:

a is <a href="3D%22http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=" hd0j7qsevh0nnwz9dfxbdqzncipebqzgfqqebqkncg0jb_0nzv0knqxrxtawnoywvslnvwbgf3c="2tpQHVwbGF3c2tpLmV1oJLEEFMIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQg=" bwfyycouy2hhz2xeeans0kzqsddese9o0ltqgfln0kll0nsir0foruzbie5puk1btkrjrq="3D=" target='3D"_blank"' style='3D"text-decoration:' none color:>Aff= icher le message dans mon navigateur</a> MailLinks: DEBUG 8-41-54: href is: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE= MailLinks: DEBUG 8-41-54: link is ex href.content: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=

Edit:

This shows, that the URL is chopped into pieces, the first time, around the sequence
“NDQjEhD0J”, where it becomes “NDQJE=” and the remainder appears in the a-node after the href-attribute is already closed, in the way of an additional attribute...

This lets me assume a bug in Nokogiri.

Code

The code which produces the above log-output is quite straight forward. First I create, somewhere, a node-list with all the hyperlinks:

Code:

a_nodes = html_mail.xpath('.//a')

A few lines further down, a new HTML-structure is created, which will be displayed in w3m, in the end:

Code:

a_nodes.each_with_index do |a, i| @log.debug('a is ' << a.to_s) caption = a.inner_text.to_s.strip href = a.attribute('href') if(href) @log.debug('href is: ' << href.to_s) link = href.content @log.debug('link is ex href.content: ' << link.to_s) a.add_next_sibling("<span>[" << (i.next).to_s << "]</span>") dl.add_child("<dt>%i) %s</dt>"%[i.next, caption]) dl.add_child("<dd style='white-space:nowrap;'><a href='%s'>%s</a></dd>"%[link,link]) end end

That's it. The utility is called “maillinks” and I appear to have published a gem-version in the past. As the version number is the same as my local gem, the available gem should be identical to the program I used above...

Question

What shall I do?

Forgot, and because it is so nicely formated:

Code:

user@machine:~$ nokogiri --version

# Nokogiri (1.8.2)

    ---

    warnings: []

    nokogiri: 1.8.2

    ruby:

      version: 2.6.0

      platform: x86_64-linux

      description: ruby 2.6.0dev (2018-03-29 trunk 63037) [x86_64-linux]

      engine: ruby

    libxml:

      binding: extension

      source: packaged

      libxml2_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.7"

      libxslt_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"

      libxml2_patches: []

      libxslt_patches: []

      compiled: 2.9.7

      loaded: 2.9.7

Okay, I got confused, because my routine failed only “sometimes”.

Reason: What I called a chopped URL, above, is actually already cut into pieces in the original mail! This has nothing to do with Quoted-Printable or Nokogiri.

I must find a trailing '=' right before a line-break inside the URL (href value), then eliminate it before joining the current string with the following line; and all before I let Nokogiri parse any HTML-mail.

Who has invented this...

Or simpler, still:

Code:

gem install mail

then

Code:

require 'mail'

mail_body = Mail::Encodings::QuotedPrintable.decode(mail_body)

Code:

user@machine~$ ri Mail::Encodings::QuotedPrintable.decode



(from gem mail-2.7.0)

=== Implementation from QuotedPrintable

------------------------------------------------------------------------

  decode(str)



------------------------------------------------------------------------



Decode the string from Quoted-Printable. Cope with hard line breaks that

were incorrectly encoded as hex instead of literal CRLF.

Here is the html-version of the man-page for maillinks.

Edit: Halleluja. Now even the HTML-body of the original mail is formatted correctly... and looks much nicer than I am ready to admit.

I see this often in "raw" email...that is, unprocessed by any MUA. I have always presumed the "=" at the end of each line were somehow the result of MIME en/decoding. Here's an snippet of a "multi-part message in MIME format"...a spam message I'll be reporting to the sending ISP.

Code:

Content-Type: text/html; charset=utf-8

Content-Transfer-Encoding: quoted-printable



<div style=3D"background:#328EDF;border: 1px solid #ABC8E2; padding:45px 0 =

0; max-width:700px;font-family:helvetica; line-height:26px;">

                                <table cellspacing=3D"0" cellpadding=3D"0" style=3D"margin:0 auto;width=

: 100%;line-height:initial;">

                                <tr background=3D"" style=3D"text-align:center;">

                                <td style=3D"vertical-align:top;">

                                </td>

                                </tr>

                                <tr>

                                <td style=3D"text-align:left;font-size:15px;">

                                <div style=3D"background:#ffffff;padding: 14px 77px; margin:0 auto; col=

or:#333;line-height: 28px;min-height: 150px;">

                                                                        =09

                                <div>Hi <span style=3D"font-weight: bold;">Mailer-Daemon,</span></div>H=

i,Wished to check if you would be interested in&nbsp;achieving Six Sigma Gr=

een Belt (SSGB) Training and Certification&nbsp;in 3 Days&nbsp;at your loca=

tion.<span><span>Batch 1:&nbsp; April 09th&nbsp;To April 11th&nbsp;2018<br>=

[snip]

Note that the trailing = appears to be a function of line length, or something. It's not specific to URLs, but I can see how it's messing up your process.

When I've had occasion to (manually) tweak this, I've just replaced a = at end of line with nothing and deleted the EOL:

Code:

sed 's/=$//'

(but I don't know how to program the join to the next line...sorry)

Maybe you already know all that...just wanted to share that it's not a URL specific thing.

Quote:

Originally Posted by scasey (Post 5839047)

just wanted to share that it's not a URL specific thing.

Yes, it has become obvious finally, when I used the Mail-gem to “decode & repair” the whole message-body prior analyzing. There is a new version of my gem available on rubygems.org (link dysfunctional at 05:52 UTC).

... Now I receive “Email” which is basically empty with just 1 link to an attached PNG, containing all the message.., or just the result of some designer playing with her/his tools. I would not object a PDF containing a hand-written letter and did that myself, already. But you should be warned in the message-body.

The Internet is very sick.