LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-03-2018, 01:58 AM   #1
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 687
Blog Entries: 22

Rep: Reputation: 420Reputation: 420Reputation: 420Reputation: 420Reputation: 420
[Nokogiri/Ruby] Extract captions and URLs from hyperlinks


Good morning.

I seek your advice to solve a programming problem with Nokogiri in a Ruby-program.

This will be tricky, not in code but in language. I may be unable to express my simple question in plain English and ask you to have patience; maybe I have to clarify things in a followup, below. So do not hesitate to demand details, if this is not clear, at first... (sigh).

Background.
My mail-user agent is Mutt (you knew it). And thus, those mails I get mostly from offices of all kinds of administration do NEVER conform to any standard. I write “thus”, because Mutt gives me a chance to notice.

One of those problems with HTML-only mail, I solved with a script which extracts hyperlinks and lists them below the mail-text, together with the actual caption of the link in the mail. I detect hyperlinks where I had not even suspected them, before. The script is called from a Mutt-Macro, so I do not have to leave the mail-client.
Problem.
This does not always work as it should.
Example.
... Wait a moment. This mail does not convey any private data. I attach it below. A good example which may spare me more words. The file is in eml-format and contains all parts, meaning, that you can compare the html- to a plaintext version.

If I have no plaintext part in a mail, all I see is it's html-version and in Mutt, none of the links would be visible.

When I let my script munge this very mail (it's html-part), I get a list of defective links like this:
Screen Shot.
Log-output.
In the debug-log I find part of the answer:
The link is badly handled, and what was an URL like
http://eye.sbc09.com/m2?r=pTE1NDA5xBAUdld90NDQjEhD0J7Qsevh0NNwZ9DfxBDQzNCIPEbQzgfQqEbQkNCg0Jb_0NZV0KnQxrxtaWNoYWVsLnVwbGF 3c2tpQHVwbGF3c2tpLmV1oJLEEF
+MIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQgbWFyYcOuY2hhZ2XEEAns0KzQsdDeSE9O0LTQgfln0KlL0NsIr0FORUZBIE5P Uk1BTkRJRQ==
with a caption Afficher le message dans mon navigateur has become
Code:
 a is <a href="3D%22http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=" hd0j7qsevh0nnwz9dfxbdqzncipebqzgfqqebqkncg0jb_0nzv0knqxrxtawnoywvslnvwbgf3c="2tpQHVwbGF3c2tpLmV1oJLEEFMIEefQ2uj-TNCEYDvQzvnQy9CUDLtIb3J0aWN1bHR1cmUgZXQg=" bwfyycouy2hhz2xeeans0kzqsddese9o0ltqgfln0kll0nsir0foruzbie5puk1btkrjrq="3D=" target='3D"_blank"' style='3D"text-decoration:' none color:>Aff=
icher le message dans mon navigateur</a>
MailLinks: DEBUG 8-41-54: href is: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=
MailLinks: DEBUG 8-41-54: link is ex href.content: 3D"http://eye.sbc09.com/m2?r=3DpTE1NDA5xBAUdld90NDQjE=
Edit:
This shows, that the URL is chopped into pieces, the first time, around the sequence
“NDQjEhD0J”, where it becomes “NDQJE=” and the remainder appears in the a-node after the href-attribute is already closed, in the way of an additional attribute...
This lets me assume a bug in Nokogiri.

Code
The code which produces the above log-output is quite straight forward. First I create, somewhere, a node-list with all the hyperlinks:
Code:
a_nodes = html_mail.xpath('.//a')
A few lines further down, a new HTML-structure is created, which will be displayed in w3m, in the end:
Code:
      a_nodes.each_with_index do |a, i|
        @log.debug('a is ' << a.to_s)
        caption = a.inner_text.to_s.strip
        href = a.attribute('href')
        if(href)
          @log.debug('href is: ' << href.to_s)
          link = href.content
          @log.debug('link is ex href.content: ' << link.to_s)
          a.add_next_sibling("<span>[" << (i.next).to_s << "]</span>")
          dl.add_child("<dt>%i) %s</dt>"%[i.next, caption])
          dl.add_child("<dd style='white-space:nowrap;'><a href='%s'>%s</a></dd>"%[link,link])
        end
      end
That's it. The utility is called “maillinks” and I appear to have published a gem-version in the past. As the version number is the same as my local gem, the available gem should be identical to the program I used above...

Question
What shall I do?
Attached Thumbnails
Click image for larger version

Name:	sc_mutt_links.png
Views:	15
Size:	55.4 KB
ID:	27377  
Attached Files
File Type: txt mail_ex.eml.txt (32.0 KB, 1 views)

Last edited by Michael Uplawski; 04-03-2018 at 07:49 AM.
 
Old 04-03-2018, 04:06 AM   #2
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 687
Blog Entries: 22

Original Poster
Rep: Reputation: 420Reputation: 420Reputation: 420Reputation: 420Reputation: 420
Forgot, and because it is so nicely formated:
Code:
user@machine:~$ nokogiri --version
# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.6.0
      platform: x86_64-linux
      description: ruby 2.6.0dev (2018-03-29 trunk 63037) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.7"
      libxslt_path: "/var/lib/gems/2.6.0/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.7
      loaded: 2.9.7
 
Old 04-03-2018, 07:48 AM   #3
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 687
Blog Entries: 22

Original Poster
Rep: Reputation: 420Reputation: 420Reputation: 420Reputation: 420Reputation: 420
Okay, I got confused, because my routine failed only “sometimes”.

Reason: What I called a chopped URL, above, is actually already cut into pieces in the original mail! This has nothing to do with Quoted-Printable or Nokogiri.

I must find a trailing '=' right before a line-break inside the URL (href value), then eliminate it before joining the current string with the following line; and all before I let Nokogiri parse any HTML-mail.

Who has invented this...
 
Old 04-03-2018, 11:12 AM   #4
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 687
Blog Entries: 22

Original Poster
Rep: Reputation: 420Reputation: 420Reputation: 420Reputation: 420Reputation: 420
Or simpler, still:

Code:
gem install mail
then

Code:
require 'mail'
mail_body = Mail::Encodings::QuotedPrintable.decode(mail_body)
Code:
user@machine~$ ri Mail::Encodings::QuotedPrintable.decode

(from gem mail-2.7.0)
=== Implementation from QuotedPrintable
------------------------------------------------------------------------
  decode(str)

------------------------------------------------------------------------

Decode the string from Quoted-Printable. Cope with hard line breaks that
were incorrectly encoded as hex instead of literal CRLF.
Here is the html-version of the man-page for maillinks.

Edit: Halleluja. Now even the HTML-body of the original mail is formatted correctly... and looks much nicer than I am ready to admit.

Last edited by Michael Uplawski; 04-04-2018 at 12:44 AM. Reason: confusing the confusion will not make it go away
 
Old 04-03-2018, 06:07 PM   #5
scasey
Senior Member
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.5
Posts: 1,612

Rep: Reputation: 527Reputation: 527Reputation: 527Reputation: 527Reputation: 527Reputation: 527
I see this often in "raw" email...that is, unprocessed by any MUA. I have always presumed the "=" at the end of each line were somehow the result of MIME en/decoding. Here's an snippet of a "multi-part message in MIME format"...a spam message I'll be reporting to the sending ISP.
Code:
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<div style=3D"background:#328EDF;border: 1px solid #ABC8E2; padding:45px 0 =
0; max-width:700px;font-family:helvetica; line-height:26px;">
                                <table cellspacing=3D"0" cellpadding=3D"0" style=3D"margin:0 auto;width=
: 100%;line-height:initial;">
                                <tr background=3D"" style=3D"text-align:center;">
                                <td style=3D"vertical-align:top;">
                                </td>
                                </tr>
                                <tr>
                                <td style=3D"text-align:left;font-size:15px;">
                                <div style=3D"background:#ffffff;padding: 14px 77px; margin:0 auto; col=
or:#333;line-height: 28px;min-height: 150px;">
                                                                        =09
                                <div>Hi <span style=3D"font-weight: bold;">Mailer-Daemon,</span></div>H=
i,Wished to check if you would be interested in&nbsp;achieving Six Sigma Gr=
een Belt (SSGB) Training and Certification&nbsp;in 3 Days&nbsp;at your loca=
tion.<span><span>Batch 1:&nbsp; April 09th&nbsp;To April 11th&nbsp;2018<br>=
[snip]
Note that the trailing = appears to be a function of line length, or something. It's not specific to URLs, but I can see how it's messing up your process.

When I've had occasion to (manually) tweak this, I've just replaced a = at end of line with nothing and deleted the EOL:
Code:
sed 's/=$//'
(but I don't know how to program the join to the next line...sorry)

Maybe you already know all that...just wanted to share that it's not a URL specific thing.
 
1 members found this post helpful.
Old 04-04-2018, 12:50 AM   #6
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 687
Blog Entries: 22

Original Poster
Rep: Reputation: 420Reputation: 420Reputation: 420Reputation: 420Reputation: 420
Quote:
Originally Posted by scasey View Post
just wanted to share that it's not a URL specific thing.
Yes, it has become obvious finally, when I used the Mail-gem to “decode & repair” the whole message-body prior analyzing. There is a new version of my gem available on rubygems.org (link dysfunctional at 05:52 UTC).

... Now I receive “Email” which is basically empty with just 1 link to an attached PNG, containing all the message.., or just the result of some designer playing with her/his tools. I would not object a PDF containing a hand-written letter and did that myself, already. But you should be warned in the message-body.

The Internet is very sick.

Last edited by Michael Uplawski; 04-04-2018 at 12:53 AM.
 
  


Reply

Tags
html-mail, mutt, nokogiri, ruby


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
parse HTML on the command-line (xpath/css) with nokogiri Michael Uplawski Programming 1 10-09-2017 02:16 AM
How to do search & replace on a text file--need to extract URLs from a sitemap file Mountain Linux - General 4 08-07-2015 10:52 AM
Tools to extract captions and sound from DVDs? Dims Linux - Newbie 7 05-15-2009 09:29 AM
Firefox Can't Follow HTML Hyperlinks To Local Files - Recognize Relative Hyperlinks? Dave Chicago Linux - Newbie 6 03-18-2009 08:07 PM
how to check urls and stop internet urls in network gface Linux - Networking 5 03-24-2005 09:48 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration