LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   [Nokogiri/Ruby] Extract captions and URLs from hyperlinks (https://www.linuxquestions.org/questions/programming-9/%5Bnokogiri-ruby%5D-extract-captions-and-urls-from-hyperlinks-4175626902/)

Michael Uplawski 04-03-2018 01:58 AM

[Nokogiri/Ruby] Extract captions and URLs from hyperlinks
 
2 Attachment(s)
Good morning.

I seek your advice to solve a programming problem with Nokogiri in a Ruby-program.

This will be tricky, not in code but in language. I may be unable to express my simple question in plain English and ask you to have patience; maybe I have to clarify things in a followup, below. So do not hesitate to demand details, if this is not clear, at first... (sigh).

Background.
My mail-user agent is Mutt (you knew it). And thus, those mails I get mostly from offices of all kinds of administration do NEVER conform to any standard. I write “thus”, because Mutt gives me a chance to notice.

One of those problems with HTML-only mail, I solved with a script which extracts hyperlinks and lists them below the mail-text, together with the actual caption of the link in the mail. I detect hyperlinks where I had not even suspected them, before. The script is called from a Mutt-Macro, so I do not have to leave the mail-client.
Problem.
This does not always work as it should.
Example.
... Wait a moment. This mail does not convey any private data. I attach it below. A good example which may spare me more words. The file is in eml-format and contains all parts, meaning, that you can compare the html- to a plaintext version.

If I have no plaintext part in a mail, all I see is it's html-version and in Mutt, none of the links would be visible.

When I let my script munge this very mail (it's html-part), I get a list of defective links like this:
Screen Shot.
Log-output.
In the debug-log I find part of the answer:
The link is badly handled, and what was an URL like
http://eye.sbc09.com/m2?r=pTE1NDA5xBAUdld90NDQjEhD0J7Qsevh0NNwZ9DfxBDQzNCIPEbQzgfQqEbQkNCg0Jb_0NZV0KnQxrxtaWNoYWVsLnVwbGF 3c2tpQHVwbGF3c2tpLmV1oJLEEF
+MIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQgbWFyYcOuY2hhZ2XEEAns0KzQsdDeSE9O0LTQgfln0KlL0NsIr0FORUZBIE5P Uk1BTkRJRQ==
with a caption « Afficher le message dans mon navigateur » has become
Code:

a is <a href="3D%22http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=" hd0j7qsevh0nnwz9dfxbdqzncipebqzgfqqebqkncg0jb_0nzv0knqxrxtawnoywvslnvwbgf3c="2tpQHVwbGF3c2tpLmV1oJLEEFMIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQg=" bwfyycouy2hhz2xeeans0kzqsddese9o0ltqgfln0kll0nsir0foruzbie5puk1btkrjrq="3D=" target='3D"_blank"' style='3D"text-decoration:' none color:>Aff=
icher le message dans mon navigateur</a>
MailLinks: DEBUG 8-41-54: href is: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=
MailLinks: DEBUG 8-41-54: link is ex href.content: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=

Edit:
This shows, that the URL is chopped into pieces, the first time, around the sequence
“NDQjEhD0J”, where it becomes “NDQJE=” and the remainder appears in the a-node after the href-attribute is already closed, in the way of an additional attribute...
This lets me assume a bug in Nokogiri.

Code
The code which produces the above log-output is quite straight forward. First I create, somewhere, a node-list with all the hyperlinks:
Code:

a_nodes = html_mail.xpath('.//a')
A few lines further down, a new HTML-structure is created, which will be displayed in w3m, in the end:
Code:

      a_nodes.each_with_index do |a, i|
        @log.debug('a is ' << a.to_s)
        caption = a.inner_text.to_s.strip
        href = a.attribute('href')
        if(href)
          @log.debug('href is: ' << href.to_s)
          link = href.content
          @log.debug('link is ex href.content: ' << link.to_s)
          a.add_next_sibling("<span>[" << (i.next).to_s << "]</span>")
          dl.add_child("<dt>%i) %s</dt>"%[i.next, caption])
          dl.add_child("<dd style='white-space:nowrap;'><a href='%s'>%s</a></dd>"%[link,link])
        end
      end

That's it. The utility is called “maillinks” and I appear to have published a gem-version in the past. As the version number is the same as my local gem, the available gem should be identical to the program I used above...

Question
What shall I do?

Michael Uplawski 04-03-2018 04:06 AM

Forgot, and because it is so nicely formated:
Code:

user@machine:~$ nokogiri --version
# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.6.0
      platform: x86_64-linux
      description: ruby 2.6.0dev (2018-03-29 trunk 63037) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.7"
      libxslt_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.7
      loaded: 2.9.7


Michael Uplawski 04-03-2018 07:48 AM

Okay, I got confused, because my routine failed only “sometimes”.

Reason: What I called a chopped URL, above, is actually already cut into pieces in the original mail! This has nothing to do with Quoted-Printable or Nokogiri.

I must find a trailing '=' right before a line-break inside the URL (href value), then eliminate it before joining the current string with the following line; and all before I let Nokogiri parse any HTML-mail.

Who has invented this...

Michael Uplawski 04-03-2018 11:12 AM

Or simpler, still:

Code:

gem install mail
then

Code:

require 'mail'
mail_body = Mail::Encodings::QuotedPrintable.decode(mail_body)

Code:

user@machine~$ ri Mail::Encodings::QuotedPrintable.decode

(from gem mail-2.7.0)
=== Implementation from QuotedPrintable
------------------------------------------------------------------------
  decode(str)

------------------------------------------------------------------------

Decode the string from Quoted-Printable. Cope with hard line breaks that
were incorrectly encoded
as hex instead of literal CRLF.

Here is the html-version of the man-page for maillinks.

Edit: Halleluja. Now even the HTML-body of the original mail is formatted correctly... and looks much nicer than I am ready to admit.

scasey 04-03-2018 06:07 PM

I see this often in "raw" email...that is, unprocessed by any MUA. I have always presumed the "=" at the end of each line were somehow the result of MIME en/decoding. Here's an snippet of a "multi-part message in MIME format"...a spam message I'll be reporting to the sending ISP.
Code:

Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<div style=3D"background:#328EDF;border: 1px solid #ABC8E2; padding:45px 0 =
0; max-width:700px;font-family:helvetica; line-height:26px;">
                                <table cellspacing=3D"0" cellpadding=3D"0" style=3D"margin:0 auto;width=
: 100%;line-height:initial;">
                                <tr background=3D"" style=3D"text-align:center;">
                                <td style=3D"vertical-align:top;">
                                </td>
                                </tr>
                                <tr>
                                <td style=3D"text-align:left;font-size:15px;">
                                <div style=3D"background:#ffffff;padding: 14px 77px; margin:0 auto; col=
or:#333;line-height: 28px;min-height: 150px;">
                                                                        =09
                                <div>Hi <span style=3D"font-weight: bold;">Mailer-Daemon,</span></div>H=
i,Wished to check if you would be interested in&nbsp;achieving Six Sigma Gr=
een Belt (SSGB) Training and Certification&nbsp;in 3 Days&nbsp;at your loca=
tion.<span><span>Batch 1:&nbsp; April 09th&nbsp;To April 11th&nbsp;2018<br>=
[snip]

Note that the trailing = appears to be a function of line length, or something. It's not specific to URLs, but I can see how it's messing up your process.

When I've had occasion to (manually) tweak this, I've just replaced a = at end of line with nothing and deleted the EOL:
Code:

sed 's/=$//'
(but I don't know how to program the join to the next line...sorry)

Maybe you already know all that...just wanted to share that it's not a URL specific thing.

Michael Uplawski 04-04-2018 12:50 AM

Quote:

Originally Posted by scasey (Post 5839047)
just wanted to share that it's not a URL specific thing.

Yes, it has become obvious finally, when I used the Mail-gem to “decode & repair” the whole message-body prior analyzing. There is a new version of my gem available on rubygems.org (link dysfunctional at 05:52 UTC).

... Now I receive “Email” which is basically empty with just 1 link to an attached PNG, containing all the message.., or just the result of some designer playing with her/his tools. I would not object a PDF containing a hand-written letter and did that myself, already. But you should be warned in the message-body.

The Internet is very sick.


All times are GMT -5. The time now is 05:00 AM.