LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Blogs > Michael Uplawski
User Name
Password

Notices


Rate this Entry

Validate OOXML

Posted 12-11-2020 at 10:27 AM by Michael Uplawski
Updated 12-15-2020 at 12:05 AM by Michael Uplawski (moved one paragraph, cosmetics, threads linked.)



Validate OOXML

Ensure the standard-conformance of OOXML-documents created from scratch.
See also the two Threads concerning this topic: A styled version of this document: http://www.uplawski.eu/articles/Linu...ate_ooxml.html

Contents

  • Disclaimer
  • Introduction
  • Motivation
  • XML Schema validation with
    xmllint
    ↠ Setup step by step
    ↠ Invoking xmllint

Disclaimer

I have invented nothing of this; I have not found it, nor developed the procedures mentioned on this page. The complete knowledge that I only reproduce here, has been communicated to me by
NevemTeve
in a discussion on LinuxQuestions.org.

Introduction

Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.
Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.
Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file:
Code:
user@machine:/tmp/docx$ unzip ../rudi.docx
Archive:  ../rudi.docx
  inflating: _rels/.rels             
  inflating: docProps/core.xml       
  inflating: docProps/app.xml        
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/styles.xml         
  inflating: word/fontTable.xml      
  inflating: word/settings.xml       
  inflating: [Content_Types].xml
You can find the meaning of each of the XML-tags, all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: http://officeopenxml.com/index.php

Motivation

Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-tags or introduce more tags and complexer tag-structures to your document, you have to be careful to obey strictly to the rules of the OOXML standard. Where programmed routines are responsible for those manipulations, they can rapidly and profoundly alter the file-structures together with the actual content.
Even if, after opening the resulting document in your wordprocessor, all looks fine and just as you want it, other programs can be in trouble, if your OOXML code is not what they expect. But interoperability, comparability and comprehension is what standards are initially meant to achieve. You should, therefore, routinely validate your own OOXML-documents against the OOXML-standard to be sure that routines which generate or modify OOXML files, work reliably in all situations.
This document describes a way to validate OOXML wordprocessor files against the pertinent OOXML Schemas, in order to locate and identify potential errors.

XML Schema validation with xmllint

I prefer to first present you the command-line which you will execute to validate a wordprocessor-file and explain its components. The objective is then to ensure that the conditions for the successful command execution are met (read on below).
One last remark. A surprising amount of file-manipulations are needed, before you can validate OOXML with the procedure I chose to present on this page. I consider this unsatisfactory and still seek simplification. But also note that, once that the preparations are completed, repeated validations are as easy as launching xmllint with the few arguments that are included in the command, shown here:
Code:
xmllint -noout -debugent -schema ooxml_xsd/wml.xsd document.xml
xmllint
xmllint is an XML-parser for many purposes. Consult the xmllint man-page for the complete description of its many options. On a Linux system, xmllint is part of libxml.
-noout
This option specifies that xmllint shall not produce output other than potential error- and warning-messages.
-debugent
Comments will be printed concerning entities which are defined in the source-document.
-schema
The location of the initial schema-file, which will be read to compare the source-document to the standard.
document.xml
The XML-document which is validated. document.xml is also the main component of an OOXML wordprocessor file. This is where the textual content and the structure of the enclosing tags are found, like in this (scrollable) example of a file document.xml:
Code:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
mc:Ignorable="w14 wp14">
  <w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading1" />
        <w:bidi w:val="0" />
        <w:spacing w:before="240" w:after="120" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Validate OOXML</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Ensure the standard-conformance of OOXML-documents
        created from scratch.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Contents</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:bookmarkStart w:id="0" w:name="intro" />
      <w:bookmarkEnd w:id="0" />
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Introduction</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Even if modern wordprocessors are enriched with many
        functions which facilitate the redaction of complex
        text-documents, some recurring tasks, which are performed
        often and in the same way within the same document, cannot
        be completely automated with the commands that the program
        offers. As documents are produced for specific purposes and
        the needs of individual users cannot be anticipated in all
        detail, software-companies integrate scripting interfaces
        to their office-software.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Where such a scripting interface is not present, you
        can still automate the generation and manipulation of
        office-documents.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Modern wordprocessors read from and write to
        compressed XML-files. The Microsoft® file-format OOXML –
        e.g in docx-files – as well as ODF base on XML. To read and
        manipulate the content and formatting of such documents you
        only need to edit the XML-files which you discover after
        unzipping an ODT- or DOCX-file.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>You can find the meaning of each of the XML-tags all
        the possible XML-attributes, as well as the rules for their
        deployment in specific contexts on specialised web-sites.
        Here, I want to concentrate on OOXML only: [ OOXML -
        reference goes here ]</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:bookmarkStart w:id="1" w:name="motivation" />
      <w:bookmarkEnd w:id="1" />
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Motivation</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Writing OOXML from scratch can be complicated. As long
        as you do only modify text-nodes, nothing can happen. But
        as soon as you manipulate XML-nodes or introduce more tags
        and complexer tag-structures to your document, you have to
        be careful to obey st</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Normal" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:type w:val="nextPage" />
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:left="1134" w:right="1134" w:header="0"
      w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0" />
      <w:pgNumType w:fmt="decimal" />
      <w:formProt w:val="false" />
      <w:textDirection w:val="lrTb" />
    </w:sectPr>
  </w:body>
</w:document>
Before you can validate anything, you must ensure that all the necessary schemas , in the form of *.xsd files, can be accessed by an XML-parser.
I will show you the steps to establish this “validating-environment”.

Set-up step by step

I. Provide the schema catalog
Ensure that the file /usr/local/etc/xml/catalog exists, create it otherwise, as root :
Code:
$ mkdir -p /usr/local/etc/xml
$ cat >/usr/local/etc/xml/catalog <<DONE
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>
  <nextCatalog catalog="file:///etc/xml/catalog"/>
</catalog>
DONE
II. Provide xml.xsd
Ensure that /usr/local/etc/xml/catalog contains the line
Code:
  <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>
If it is missing, insert the line before the tag <nextCatalog> , just like it is shown above.
You must also get the actual file xml.xsd:
Code:
wget -O /usr/local/etc/xml/xml_2009_01.xsd http://www.w3.org/2009/01/xml.xsd
III. Provide the OOXML-xsd files
The schema files can be downloaded from
https://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.4/
.
Choose the file
ooxml-schemas-1.4.jar
and download it.
Unzip the file, e.g. to your temporary directory and locate the xsd-files in the sub-directory /schemaorg_apache_xmlbeans/src . Move all the xsd-files to a directory that will be accessible later, when calling the xml-parser, e.g. a sub-directory of your working-directory:
Code:
          :~/project$ mkdir ooxml_xsd
          :~/project$ cd ooxml_xsd
          :~/project$/ooxml_xsd mv /tmp/schemaorg_apache_xmlbeans/src/*.xsd ./
IV. Complete wml.xsd
Open the schema file wml.xsd and find the tag <xsd:import> with the id xml (the fourth at the time of this writing). Complete this line with the schemaLocation attribute or replace it, so that it is identical to the following:
Code:
  <xsd:import id="xml" namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/XML/1998/namespace"/>
V. Consolidate duplicated import of the same namespace in dml-wordprocessingDrawing.xsd
Create a xsd-file dml-wordprocessingDrawing_import.xsd with the following content:
Code:
<?xml version="1.0" encoding="utf-8"?>
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/drawingml/2006/main"
   elementFormDefault="qualified"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:include schemaLocation="dml-graphicalObject.xsd"/>
  <xsd:include schemaLocation="dml-documentProperties.xsd"/>
</xsd:schema>
Now open the schema file dml-wordprocessingDrawing.xsd . Replace the two tags <xsd:import> with the schemaLocations dml-wordprocessingDrawing.xsd and dml-documentProperties.xsd by one single line which imports only the newly created schema-file
Code:
  <xsd:import schemaLocation="dml-wordprocessingDrawing_import.xsd" namespace="http://schemas.openxmlformats.org/drawingml/2006/main" />

Invoking xmllint

The call to xmllint is already shown, above, but prior executing the command, you must remember to set the environment variable XML_CATALOG_FILES to the location of the schema catalog as, otherwise, the standard path /etc/xml/catalog would be read. This is an example of a successful validation with xmllint after having completed the preparatory tasks, listed above :
Code:
user@machine:/tmp$ export XML_CATALOG_FILES=/usr/local/etc/xml/catalog 
user@machine:/tmp$ xmllint -noout -debugent -schema ~/prog/ooxml_xsd/wml.xsd ./docx/word/document.xml 
new input from file: /prog/ooxml_xsd/wml.xsd
new input from file: /prog/ooxml_xsd/shared-customXmlSchemaProperties.xsd
new input from file: /prog/ooxml_xsd/shared-math.xsd
new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing.xsd
new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing_import.xsd
new input from file: /prog/ooxml_xsd/dml-graphicalObject.xsd
new input from file: /prog/ooxml_xsd/dml-documentProperties.xsd
new input from file: /prog/ooxml_xsd/dml-baseTypes.xsd
new input from file: /prog/ooxml_xsd/shared-relationshipReference.xsd
new input from file: /prog/ooxml_xsd/dml-shapeGeometry.xsd
new input from file: /prog/xml.xsd
new input from file: docx/word/document.xml
docx/word/document.xml validates
DOCUMENT
No entities in internal subset
No entities in external subset
Now please just believe me: This is cool.
Views 288 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 03:37 AM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration