Translating XML documents with XLIFF

Some times ago I found this nice article about XML in localisation. Now I want to experiment with XLIFF a bit, because I could use it to translate an XML document I am writing.

I know that XLIFF is not the only way to translate an XML document, and neither its use is limited to XML documents only, it's just that using an XML document format to translate an XML document seems so natural!

The translation problem

Some very basic nomenclature for the translation problem (applicable in the case of CAT as well):

original document
the document to translate
original format
the format of the original document
intermediate format
a document format to submit to the translator; it holds the contents of the original document but it is independent from the original format
destination document
the translated document
source language
the language of the original document
target language
the language of the destination document
translation-unit
the element of the content meant to be translated independently from the others, typically for practical reasons the sentence or the paragraph are chosen, but in theory also words and letters, the whole document or any other element of a specific document format can be taken as TU.
segmentation
the process of splitting the original document in translation-units
translation
more formally, the equivalence between the source language and the target language for a translation-unit
translation-memory
a database of translations

Using an intermediate format makes the document's content independent from the document's format, this way the translator does not need to care about the actual document format, plus he/she can use specialized tools to ease the translation process.

An XLIFF workflow to solve the problem is basically as follows:

  1. Create a representation of the original document using XLIFF as an intermediate format, at this stage the segmentation process takes place as well. The original format could be stored in a separate file as a skeleton file without actual content, but it can well be embedded directly in the XLIFF file.
  2. Translate the content into the target language, using an XLIFF editor (or even a text editor, you know). Now that the content is split in TUs it can be easily seen how translating them one by one makes the contents of the original document converge to the contents of the destination document, this is an obvious concept, but I liked to stress it.
  3. Create the destination document converting the XLIFF representation back to the original format.

An example for XML documents

The tools we are going to use:

  • xsltproc, you can find it packaged for any Unix system, I think.
  • xml2xliff.xsl and xliff2xml.xsl from xliffRoundTrip Tool
  • Transolution has a nice XLIFF editor called xliffeditor :).
    The latest version can be found in their SVN repository.

    I've also tried Virtaal but it does not handle nested elements in XLIFF very well.

Our original document is test-xliff-en.html:

<?xml version="1.0" encoding="utf-8"?>
<!--
<!DOCTYPE
 html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
 -->
 <html xmlns="http://www.w3.org/1999/xhtml"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd"
       xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <title>Test page for XLIFF</title>
    <style type="text/css" media="screen, projection">
	    .styled { font-size: .8em; color: red; font-face: sans-serif; }
    </style>
  </head>
  <body>
    <p>A paragraph with some <em>styles</em> just see how <strong>nested elements</strong> are handled.</p>
    <ul>
      <li>Item with a <a title="test link" href="#">link</a> in it.</li>
      <li>Item with an <span class="styled">CSS style</span>.</li>
    </ul>
    <p>End of page</p>
  </body>
</html>

Please note the commented out DOCTYPE declaration, this is needed to not loose it, as XSLT doesn't know how to handle it properly. It has to be decommented when we are done.

DOCTYPE is not in XSL and XPath data model (this article outlines a possible solution) and neither is CDATA, so if you need to handle these elements you may want to rely on some preprocessing before converting to XLIFF, and some postprocessing after converting back to the original format.

Also, xml2xliff.xsl does not support elements with a namespace, yet. That would be great to have.

Convert the document to XLIFF: xsltproc xml2xliff.xsl test-xliff-en.html > english-english.xlf

XLIFF specification supports only bi-lingual translation for now, this is not a big limitation as usually the translation if from an official language into the others.

The result of the conversion to XLIFF before the translation:

<?xml version="1.0" encoding="utf-8"?>
<!--
<!DOCTYPE
 html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
 -->
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xmrk="urn:xmarker" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2                   xliff-core-1.2-strict.xsd         urn:xmarker           xmarker.xsd" version="1.2">
  <file datatype="plaintext" source-language="en" original="html">
    <header>
      <xmrk:nest>
        <xmrk:html xmarker_idref="html-0" xsi:schemaLocation="http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd" xml:lang="en">
          <xmrk:head xmarker_idref="head-1">
            <xmrk:meta xmarker_idref="meta-2" http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
            <xmrk:title xmarker_idref="title-3"/>
            <xmrk:style xmarker_idref="style-4" type="text/css" media="screen, projection"/>
          </xmrk:head>
          <xmrk:body xmarker_idref="body-5">
            <xmrk:p xmarker_idref="p-6">
              <xmrk:em xmarker_idref="em-7"/>
              <xmrk:strong xmarker_idref="strong-8"/>
            </xmrk:p>
            <xmrk:ul xmarker_idref="ul-9">
              <xmrk:li xmarker_idref="li-10">
                <xmrk:a xmarker_idref="a-11" title="test link" href="#"/>
              </xmrk:li>
              <xmrk:li xmarker_idref="li-12">
                <xmrk:span xmarker_idref="span-13" class="styled"/>
              </xmrk:li>
            </xmrk:ul>
            <xmrk:p xmarker_idref="p-14"/>
          </xmrk:body>
        </xmrk:html>
      </xmrk:nest>
    </header>
    <body>
      <group id="id2297750axmarkhtml-0">
        <group id="id2297759bxmarkhead-1">
          <group id="id2297761exmarkmeta-2"/>
          <trans-unit id="title-3">
            <source>Test page for XLIFF</source>
            <target>Test page for XLIFF</target>
          </trans-unit>
          <trans-unit id="style-4">
            <source>
	    .styled { font-size: .8em; color: red; font-face: sans-serif; }
    </source>
            <target>
	    .styled { font-size: .8em; color: red; font-face: sans-serif; }
    </target>
          </trans-unit>
        </group>
        <group id="id2298359bxmarkbody-5">
          <trans-unit id="p-6">
            <source>A paragraph with some <g id="em-7">styles</g> just see how <g id="strong-8">nested elements</g> are handled.</source>
            <target>A paragraph with some <g id="em-7">styles</g> just see how <g id="strong-8">nested elements</g> are handled.</target>
          </trans-unit>
          <group id="id2298372bxmarkul-9">
            <trans-unit id="li-10">
              <source>Item with a <g id="a-11">link</g> in it.</source>
              <target>Item with a <g id="a-11">link</g> in it.</target>
            </trans-unit>
            <trans-unit id="li-12">
              <source>Item with an <g id="span-13">CSS style</g>.</source>
              <target>Item with an <g id="span-13">CSS style</g>.</target>
            </trans-unit>
          </group>
          <trans-unit id="p-14">
            <source>End of page</source>
            <target>End of page</target>
          </trans-unit>
        </group>
      </group>
    </body>
  </file>
</xliff>

Translate the content using xliffeditor and save it to english-italian.xlf. Here is a screenshot of xliffeditor from its author: xliffeditor

Convert back to the original XML format: xsltproc xliff2xml.xsl english-italian.xlf > test-xliff-it.html

Our destination document:

<?xml version="1.0" encoding="utf-8"?>
<!--
<!DOCTYPE
 html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
 -->
<html xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en" xsi:schemaLocation="http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd">
  <head>
    <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/>
    <title>Pagina di prova per XLIFF</title>
    <style media="screen, projection" type="text/css">
	    .styled { font-size: .8em; color: red; font-face: sans-serif; }
    </style>
  </head>
  <body>
    <p>Un paragrafo con degli <em>stili</em> solo per vedere come sono gestiti gli <strong>elementi innestati</strong>.</p>
    <ul>
      <li>Voce contenente un <a href="#" title="test link">collegamento</a>.</li>
      <li>Voce con uno <span class="styled">stile CSS</span>.</li>
    </ul>
    <p>Fine della pagina</p>
  </body>
</html>

The only thing missing now is the value of the xml:lang attribute, but this can be handled in a postprocess script as well.


CommentsSyndicate content

You could try this

Bless's picture

You could try this translation tool to help you out with the xliffs: https://poeditor.com. It has a really easy to use interface.

Hi, I came across your post

An's picture

Hi,
I came across your post by searching about XLIIF format file.
One website translation tool delivers the content of website in this format file, which is very useful for me as translator. But I am trying to find out if it is possible to convert this format to Word document? If not, would that be useful if I send this translated website content to customer in XLIIF format and they can easily use it? or it needs to be converted to xml format, to be able to publish this translated website?
Maybe you could help? Thank you.

Hi An, XLIFF is just a

ao2's picture

Hi An,

XLIFF is just a format to represent the original document split in translation units.

You can go to and from the original format in different ways: going XML -> XLIFF -> XML, is easier because XSLT can be used; for other formats you have to find ad-hoc solutions.

Look explicitly for a Word to XLIFF converter, or maybe use the Open Document Format (e.g. ODT) in Libreoffice and rely on odf2xliff: http://translate-toolkit.readthedocs.org/en/latest/commands/odf2xliff.html

Ciao, Antonio

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
4
d
g
T
E
3
Enter the code without spaces.