Jump to content

How to Convert HTML to XML with Adobe InDesign

+ 1
  chco's Photo
Posted Oct 22 2010 03:23 PM

In XSLT, changing simple XML or HTML to a more complex XML form is "upcasting," while changing XML to a simpler DTD is "downcasting." The following excerpt from XML Publishing with Adobe InDesign shows you how to accomplish this with Adobe InDesign.
It may be that you need to convert HTML content into some form of XML. The simplest way to do this is to save your HTML as XHTML, a form of HTML that conforms to XML rules. Once it has been cleaned up as XHTML, you need to change the file extension to .xml before you import it into InDesign.

Note: You can use an open-source utility application called HTML Tidy to make valid XHTML from your web page content. It is also bundled with Adobe Dreamweaver and some other applications that you may be using. Check for the ability to save web content as XHTML or HTML 4.0, also know an "strict" HTML.

It is possible to import the XHTML file without any transformation, but we want to modify the incoming XML somewhat to make it more compatible with our InDesign layout.

We will model the XHTML elements that correspond to the InDesign tagging concepts. We don't necessarily need the <head> element, because it is used only for the HTML title bar and metadata, not for anything that we will print. So we can make a structure like this in InDesign:

Placeholder for XHTML import

Attached Image


Remember that you can use XSLT to change the order of elements (such as sorting alphabetically), but this is best done as a preprocessing step before importing XML, rather than using the Apply XSLT option.

Note: To make the import process format imported XML text automatically, create paragraph styles that match the names of the XHTML elements (h1, p, li, etc.) and use Map Tags to Styles with the Map By Name box checked to apply the paragraph styles to the placeholder elements before you import the XML.

To make the XHTML <table> element, which is always a lowercase word, match to InDesign's internal tag <Table>), which is always uppercase, we will use the Tagging Presets dialog on the Tags panel. We also want the <td> element to map to the InDesign <Cell>.

The Tagging Preset Options of the Tags panel, setting Tables and Table Cells to <table> and <td> elements

Attached Image


In my tests, I used a simple XHTML file that looked like this:

XHTMLsample.xml

<?xml version="1.0" encoding="UTF-8"?>
<!-- <?xml-stylesheet type="text/xsl" href="xmlizeXHTML.xslt"?> -->
<html>
    <head>
        <title>XHTML example</title>
    </head>
    <body>
        <h1>An example of XHTML</h1>
        <p>Some general rules for XHTML are</p>
        <ul>
            <li>Every start tag must have a matching end tag</li>
            <li>All tag pairs must end without crossing over
 other end tags (to create properly nested structures)</li>
            <li>Tag names cannot start with a number, and they
 cannot include any spaces, or "illegal" characters, such as  ? and /,
 which can be confused with parts of the markup and processing instructions.</li>
        </ul>
        <table border="1">
            <tbody>
                <tr><th>a table header (th)</th></tr>
                <tr><td>a table cell (td)</td></tr>
            </tbody>
        </table>
    </body>
</html>


The XSLT that we will use to simplify the XHTML looks like this (but this example is not developed to handle all possible XHTML elements):

xmlizeXHTML.xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:template match="/">
    <xsl:apply-templates select="html/body"></xsl:apply-templates>
    </xsl:template>
    <xsl:template match="html/body">
    <xsl:element name="body"><xsl:apply-templates/></xsl:element>
    </xsl:template>
    <!-- copy some elements directly -->
    <xsl:template match="h1|h2|h3|h4|h5|h6|p|ul|ol"><xsl:copy
-of select="."/></xsl:template>
    <!-- simplify the table structure to what InDesign uses,
 no tbody or tr elements needed -->
    <xsl:template match="table">
    <xsl:element name="Table"><xsl:apply-templates/></xsl:element>
    </xsl:template>
    <xsl:template match="tbody"><xsl:apply-templates select="tr"/></xsl:template>
    <xsl:template match="tr"><xsl:apply-templates select="th|td"/></xsl:template>
    <xsl:template match="th"><xsl:element name="Cell"
><xsl:apply-templates/></xsl:element>
    </xsl:template>
    <xsl:template match="td"><xsl:copy-of select="."/></xsl:template>
    <xsl:template match="img">
        <xsl:element name="Image"><xsl:attribute
 name="href"><xsl:value-of select="@href"/></xsl:attribute></xsl:element>
    </xsl:template>
    <!-- exclude the head tag content -->
    <xsl:template match="html/head"/>
</xsl:stylesheet>


We can import the XML and apply the XSLT to it as it comes in. This will strip off the unnecessary <head> element and simplify the <table> by removing the <tbody> and <tr> tags. Select the <body> placeholder element in the structure view and use File>Import XML, then select the XHTML file that you saved with a .xml extension.

Note: The import operation crashed sometimes when I selected an <html> element in the placeholder as the element to import into. Importing worked OK when I selected the <body> tag as the location to import the XML. This may be because we are creating an XML file that uses the <body> element as its root. At any rate, be forewarned that importing and applying XSLT can be fraught with peril—save a copy of the file with placeholders or make it an InDesign template before you start applying XSLT while importing XML.

The settings for the XML Import Options dialog will be: Apply XSLT, Clone repeating text elements for the ul/li structure, Only import elements that match existing structure (it is important to check this), and Do not import contents of whitespace-only elements.

Settings for the XML Import Options dialog for the XHTML-as-xml import

Attached Image


The results, after some tinkering with the Paragraph styles:

XHTML imported into InDesign and formatted with matching paragraph and character styles

Attached Image


Upcasting from HTML to XML for InDesign Import

You can extend this concept to all the tags in the official XHTML DTD if you wish. Generally, you would want to use XSLT:

  • To remove unnecessary structure that InDesign doesn't use (like the <head>, <tbody> and <tr> elements)

  • To wrap elements that you want to have treated as repeating blocks (like <ul> or <ol> elements that contain <li> elements)

  • To change names of elements to match InDesign's built-in names (such as Table and Cell)


Cover of XML Publishing with Adobe InDesign
Learn more about this topic from XML Publishing with Adobe InDesign. 

From Adobe InDesign CS2 to InDesign CS5, the ability to work with XML content has been built into every version of InDesign. Some of the useful applications are importing database content into InDesign to create catalog pages, exporting XML that will be useful for subsequent publishing processes, and building chunks of content that can be reused in multiple publications.

In this Short Cut, we’ll play with the contents of a college course catalog and see how we can use XML for course descriptions, tables, and other content. Underlying principles of XML structure, DTDs, and the InDesign namespace will help you develop your own XML processes. We’ll touch briefly on using InDesign to “skin” XML content, exporting as XHTML, InCopy, and the IDML package. The Advanced Topics section gives tips on using XSLT to manipulate XML in conjunction with InDesign.

Learn More Read Now on Safari


Tags:
0 Subscribe


0 Replies