LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Blogs > Michael Uplawski
User Name
Password

Notices


Rate this Entry

Generate a glossary from HTML

Posted 09-09-2017 at 06:34 AM by Michael Uplawski
Updated 08-11-2023 at 02:28 AM by Michael Uplawski (list format, formating and outlook)

HTML2INDEX

Install as a ruby-gem:
Code:
:~$ gem install html2index
Read the RDOC : http://www.rubydoc.info/gems/html2index/1.1

This program creates an index or glossary of marked expressions in a HTML-file

The current man-page is here:

--------------------------

HTML2Index


Creates an index or glossary of marked expressions in an HTML-file


SYNOPSIS
html2index -s input.html [-o output.html] [-t template.html] [-c config] [-d]
html2index -v
html2index -h

DESCRIPTION
The program identifies in an HTML source all expressions which need to be copied to the generated index and searches given dictionaries on the Web for an explanation of each expression.
The resulting glossary is written to a new HTML-file or its HTML-code printed to STDOUT.
NOTE : The default dictionaries are French: Larousse and JargonF. Non-French speakers MUST define the dictionary by editing the configuration file ~/.config/HTML2Index/config as described under Configuration below.

OPTIONS
-d, --debug:
Be verbose
-s, --source=SOURCE:
Source is the the original html-file which contains marked expressions (see Preparations , below).
-o, --out=GLOSSARY:
Glossary is the generated file in HTML-format.
-t, --template=TEMPLATE:
A html file containing placeholders for the references to the dictionaries used and the generated glossary. The placeholders are currently defined as %=dict_list=% and %=glossary=%. You can set different field-delimiters and names in the configuration-file. See below under EXAMPLE-Template for a rudimentary example.
-c, --config=CONFIG:
Configuration-file. Command-line arguments override the settings in this file. You find a functional configuration after the first program-execution in ~/.config/HTML2Index . The file is commented and can immediately be adapted to your needs.

Common Options
-h, --help Show this message
-v, --version Show program version

EXAMPLE Usage
Here is a html-page containing instructions on how to enable and disable a touchpad using the xinput command (or any other HTML-file) in the French (or any other) language.
touchpad_fr.html

Execution
Executing HTML2Index with the -s argument and the HTML-file as its value, like this:
Code:
:~$ html2index -s /[path]/touchpad_fr.html
will produce output like this with expressions from the HTML-file explained in the French language :
Code:
vi
  (JargonF): 1.  *[Unix]. «*Visual Interface*» (littéralement, «*interface
  visuelle*», ça ne s'invente pas*!) éditeur de texte du pléistocène codé par
  Bill Joy, aussi fondateur de Sun.  Des aficionados d'Unix s'en servent
  encore, même s'il est très concurrencé par Emacs. Son principal avantage
  est que quel que soit l'état de votre système (par exemple complètement
  déglingué ou allégé) il a de fortes chances de fonctionner encore
  correctement.
  La version la plus répandue est vim.

  2.  *[nom de domaine]. Nom de domaine de premier niveau des îles Vierges étasuniennes.
------------------------------------------------------------
xinput
 (JargonF): commande. *[X11] Utilitaire facilitant la gestion des
 périphériques X Window d'entrée.  Il peut en fournir la liste, détailler
 leurs propriétés et modifier celles qui peuvent l'être.
 http://www.souris-libre.fr/savoir_faire/touchpad/touchpad_fr.html Exemple
 d'utilisation: désactivation et activation rapide du pavé tactile.
If you name an output file with the -o option, html2index will direct its output in HTML-format to this file.

Preparations


Mark catchwords
In the source-code of the original HTML- page, expressions for the future glossary are marked by means of
  • a tag
  • an attribute of this tag
  • the value of the attribute.
By default, the span -tag with an attribute lang="fy" is used, 'fy' meaning Frisian, a language which is rarely used on the Web.., I venture.
Example :
Code:
<span lang="fy" xml:lang="fy">pavé tactile</span>
You can, though, define your own tag, attribute and attribute-value, if you prefer to mark expressions in your original html-file differently, like in
Example :
Code:
<em class="expression">Tripane</em>
Remember that you can combine css classes and thus economize on html-elements, if you use them anyway to style your html-content. This would complicate the task for html2index only a little bit, as we will see further below.

Configuration
Apart from the way that expressions are marked in the original html, you can prepare a few settings for HTML2Index, which influence its behaviour. Command-line options override the values stored in the configuration-file.
A default configuration will be stored in the file ~/.config/HTML2Index/config the first time that you run html2index. It should be sufficiently commented to allow you to comprehend and alter any values in the file.
However, an explanation of each one of the available variables follows:
debug:
Does the same as the command-line options '-d' or '--debug'. Accepts the values false or true or can be left empty. If set to true, this setting causes html2index to be very verbose. Usually, you do not need to change the default value to this variable, which is false .
dictionaries:
Here, you HAVE to define the online-dictionaries to consult, if you do not want to stick with the defaults, which are Larousse and JargonF, two French speaking sites, which also provide explanations in the French language only.
The dictionaries are defined with four variables, each: name, url, xpath, color . Each dictionary-definition must start with a dash, followed by a white-space, then the first variable. Each variable-name must be enclosed by colons (see comments in the config-file).
name:
The name of the dictionary, how it will be referred to in the Glossary. An example could be 'Meriam-Webster'
url:
Note here the part from the url to a search-result in the chosen dictionary, which precedes the searched expression. You determine this string by doing a search in the online-dictionary, then copy&paste the url as it is displayed in your browser. Rearrange possible request-parameters (following '?') to ensure that the searched word or expression is the very last item in the url. Remove only the searched expression and note the remainder as the value to the variable url .
xpath:
This is the xpath which identifies the HTML-element in a search-result which contains the explanation of an expression. Many resources on the Web explain how to compose an xpath. Be as specific as possible, to avoid a miss-interpretation of the xpath-expression, use html-attributes which may be applied to an HTML container-tag. Especially id , if present but also css-classes can help to identify a tag unambiguously.
color:
A hexadecimal rgb color value in single quotes is attributed to each dictionary to facilitate the identification of the dictionary which provides a specific explanation in the glossary. Exemplary colors are '800000' or '500050' . Take care to choose colors which harmonize with the background in your template-file, if you use one.
template:
An HTML-file which contains placeholders. Two placeholders are needed at the time of this writing, one to name the dictionaries which are used to look-up definitions, another one to locate the spot where the new glossary will be written. See below under EXAMPLE-Template for a rudimentary example. The default template is internally defined.
fdelim:
A character sequence which is used to mark placeholders in the HTML-template file. The default is ' %- ', meaning that a percent-symbol followed by a dash marks the beginning, a dash followed by a percent-symbol the end of a placeholder, like in %-dict-list-% for the placeholder named 'dict-list'.
placeholders:
A list of placeholder names. Currently, there are only two placeholders recognized by Html2Index: dict_list and glossary . As the value to these two variables, note the name that you chose for the placeholders in your HTML-template. The defaults are dict_list for dict_list and index for glossary .
html_tag:
This is the tag which encloses marked expressions in the original HTML-page (the source-file). Default is span
html_attribute:
An attribute of the html_tag which encloses marked expressions in the original HTML-page (the source-file). Default is lang .
html_value:
The value of an attribute of the html_tag which encloses marked expressions in the original HTML-page (the source-file). Default is fy .

EXAMPLE-Template

Assuming that the defaults are used, the following could be a working HTML-template to use with HTML2Index:
Code:
<html>
  <head><title>Glossary</title></head>
  <body>
    <h1>Glossary</h1>
    <h2>Dictionaries used to produce this glossary</h2>
    <!-- will be replaced by an unnumbered list <ul><li> ... </li></ul> -->
    %-dict_list-%
    <h2>Definitions</h2>
    <!-- will be replaced by a definition list <dl><dt><dd>... </dd></dt></dl> -->
    %-glossary-%
  </body>
</html>

ERRORS and WARNINGS
html2index warns you if the output-file exists and asks you if you want to replace it with a new version.
The program also tries to determine the file-type of the input (HTML) file and gives out a warning if the file is considered unsuitable.
Each time, that an expression cannot be found in one of the targeted dictionaries, a warning is given. All these problematic expressions will be listed in a temporary file, which is named after html2index has terminated.

SOURCE CODE and DEVELOPMENT
html2index is developed in Ruby and can be installed as a Ruby-Gem. As Ruby is an interpreter-language, the source-code of the installed version is always accessible. You can also decompress the gem-file to take a look at the code.
AUTHOR:
Michael Uplawski <michael[dot]uplawski[at]uplawski[dot]eu>
Views 362 Comments 1
« Prev     Main     Next »
Total Comments 1

Comments

  1. Old Comment
    I am ready to release a new version of the Html2Index gem, but want to provide updated documentation. There is a lot of work for me, from next week on until end of january... But here is at least the usage-message of the new version, as a kind of spoiler..:

    Code:
    :~/prog/html2index$ bin/html2index -h
    
    	Usage: html2index -s input.html [-o output.html] [-c config-file] [-t template.html] [-d]
    
    	* Will print to stdout, if the output-file is not provided.
    	* Adapt ~/.config/HTML2Index/config to your needs.
    
        -d, --debug                      Be verbose
        -s, --source=SOURCE              Source document (html)
        -o, --out=GLOSSAR                Glossar-file (html)
        -t, --template=TEMPLATE          Template (html)
        -c, --config=CONFIG              Configuration-file
        -h, --help                       Show this message
        -v, --version                    Show program version
    Posted 10-07-2017 at 05:56 AM by Michael Uplawski Michael Uplawski is offline
 

  



All times are GMT -5. The time now is 03:51 AM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration