The XML Content Management System

For Document Centric XML

Jim Tivy - Bluestream

Nenad Furtula - Bluestream

Arron Ferguson

Abstract

The use of content management systems (CMS) to author and publish document centric content of many different types is pervasive in today's public and private organizations. There are a large number of Content Management technologies available ranging from free software and open source solutions, all the way to CMS enterprise systems which can cost several hundred thousand dollars per license [CMSWatch05]. With the continued growth and acceptance of XML document formats, and the further evolution and availability of XML tools, a content management system that is oriented to XML is fast becoming an important tool. This paper explores some of the features of such an XML Content Management System (XML CMS) and some of the technologies that naturally enhance the power of such a system.

Why XML CMS

The main reason to have an XML CMS is if you have XML content. In this paper we focus on document centric XML content, meaning XML content authored by humans for humans to read. Another category of XML formats called data-centric XML is not covered here. A first step in determining why XML CMS is to determine why XML content. Reasons to have XML content are:

XML's Data Model is for Documents
The tree like structure of XML works well for documents. For example, a tree easily models sections within sections, a common structure of documents. As well, an XML tree maintains the order of content and order is important to human readable documents. This tree like structure is an underpinning of the XML data model which is formally defined in the XML Information Set specification and its definition of information items. In fact, XML goes beyond a simple tree representation to allow text items interspersed with markup, followed by yet more text. This mixed content model is also necessary to represent documents.
XML is an open Set of Standards
XML is a set of open standards. Open standards allow many vendors to create tools to support that standard and encourage information workers to make the investment to learn these standards. This results in a diversity of tools for the end user and many readily available tools to access and manipulate XML content. The open standards of XML is much better than proprietary solutions where it is more difficult to access parts of your document, difficult to re-purpose your documents and you are often locked into a single vendor for document processing tools.
XML is extensible
XML is an extensible standard (the X in XML). XML is fundamentally extensible in that XML is a set of rules about how trees of information items are constructed, but XML does not dictate what the structure and tag names of a particular kind of tree will be. This leaves domain experts such as E-Learning information architects and technical writers free to define what information items are important to them and to structure these items as they see fit. With that said, however, many organizations find this ability for XML to be any kind of information daunting. These organizations find the creation of an organization specific document information model too daunting to design and evolve. As well, many, of the same structures such as tables, paragraphs and lists occur in many kinds of documents and thus each organization has to establish these from scratch independent of other organizations. For these reasons, a standard information model developed outside the organization is most often a better choice. For instance, standard information models (or XML schemas as they are often called) such as XHTML and Dita (Darwin Information Typing Architecture) define a core set of information items but allow extensibility from that core set. With a combination of the XML namespace mechanism and/or the schema extension mechanism standard schemas can be customized for special purposes within an organization.
XML easily supports Meta Data
Another example of the extensibility of XML is how document-centric XML can be annotated with data-centric items, for example, date, author, publisher, factoryName. This extra data, often called meta data makes searching for and categorizing documents easier.
XML is becoming all pervasive
Most new document formats are XML and old formats are either already XML such as Open Document for Open Office or, in the case of Microsoft, the binary .doc format will have an XML format called .docx. Aside from a broad range of enterprise specific XML formats used by large companies such as Boeing and Nortel, an array of standard XML formats have been defined such as:

What Is An XML CMS

The XML CMS unleashes the value of having your content in XML. An XML CMS allows you to collect on the value proposition and promises of XML Content. Below is a description of selected XML CMS features and uses cases.

Fundamentally a CMS is used to satisfy two broad use cases:

  1. Authoring
  2. Publishing

Within these two broad cases there are a number of sub use cases and variations on these use cases:

Authoring

Authoring in a CMS has the following use cases:

Creating new Content
Authors will create new content for a publication.
Editing existing Content
Authors need to be able to call up old content and change large or small parts of it.
Team Authoring
Authors need to participate as part of a larger team to accomplish authoring of large publications in a collaborative fashion.
Navigation Link Authoring
Authors need to easily insert navigation links into their documents.
Document Component Authoring
Organizations will use a number of document component strategies to re-use content or for team authoring of content parts. An authoring environment must provide for working with content components.
Managing & Maintaining Existing Content
In some organizations there may be multiple people authoring a multitude of documents. As the number of documents grows, it is more and more useful to be able to search for content. The search may be performed by a new author or may be done by an existing member of the authoring team. Authors typically search to find already existing content for a number of reasons ranging from content re-use to a need to see what is written on a given subject. As well as searching for documents, authors will expect a shared folder structure for content. This shared folder structure is much like a local disk folder tree however it is visible and shared across the organization according to user permissions.

One of the most important tools to authors is the WYSIWYG (What You See Is What You Get) XML Editor. Users expect this because they have used Microsoft Word, WordPerfect or Open Office in the past. Some WYSIWYG editors run on the desktop and some run in the browser. Below is a table of example WYSIWYG editors. These editors may be configurable and support many XML vocabularies or they may support just one vocabulary. The environment of the editors may be either the desktop or the web.

Table of WYSIWYG XML Editors
Editor Vocabulary Environment Description
Amaya W3C XHTML desktop Amaya is a free XHTML editor, however in early versions right now.
Open Office Oasis Open Document 1.0 desktop Open office is a free editor for the Open Document standard.
XMetal many desktop XMetal is a Windows based editor and is configurable for any XML vocabulary.
XMetal ActiveX many browser XMetal Active X is an ActiveX control that is based on the core XMetal engine.
Epic many desktop A Windows based configurable editor.
JXHTML W3C XHTML browser Runs in any browser.

Publishing

There is not much use in authoring if there is no ability to publish content. Publishing is the process of preparing the Publication for distribution to readers. This publish process may range in complexity from the simple act of exposing the author's XHTML on the web, to a more complicated publishing pipeline. For example, a print publish pipeline could have the following steps:

assemble
Assemble the content using a map or aggregation across multiple XML files.
format
This is one or more steps that formats the XML into a print aware XML format such as formatted objects as defined in XSL/FO.
generate printable
The simplest binary print format is PDF. It is expected you can print from this format. In the case of PDF the Adobe Acrobat Reader can do the printing.

The nature of the pipeline and the publish process will vary greatly depending on the final published form. Possible final forms and their publish processes are:

publishing to the web.
Publishing to the web can be as simple as going to the website, adding a new page, authoring the page in XHTML, and saving for immediate display. Or, publishing to the web could involve exporting from a component based XML CMS to SCORM and loading the SCORM package into a separate Learning Management System. If the delivery is to the web or some other non paper form, then searching for content by website users is a needed feature of the CMS. In a web environment the CMS often provides many of the web features In fact, in web portals it is difficult to say where the CMS ends and the portal begins.
publishing printed matter.
To print you will likely use a pipeline. This publish process may follow the example pipeline shown above with steps: assemble, format, generate.
publishing to a package.
Standards like SCORM define a package format for educational courses. This package is a zip file composed of text and binary content, all described by a package manifest. Many times it is necessary to export a subset of the content in the CMS into an external zip package such as a SCORM package. Another export format would be a software help package.
publishing to the file system.
Publishing to the file system may be necessary, for instance, if you ship a file directory structure of documentation with your software product. If your directory structure contains XHTML, however, it may be necessary to re-map links of the file directory if they do not match the link structure of the CMS.

How XML Technologies Are Used

Some will argue that the underlying technology of a CMS is not important. What is important is the features of the CMS. This statement is both true and false. Of course the features are important, without them certain use cases cannot be satisfied. But without a sound technical foundation, any computer software will be sensitive to cost overruns, too many bugs and maintenance problems. In fact, in many cases features are not possible in a system without a sound technical foundation. Poor technology eventually 'leaks' through to the user. For this reason, it is always prudent to ensure that the technology 'under the hood' has a sound footing. Fortunately, XML is a family of sound technologies based on a set of consistent, interlinked standards and tools. In an XML CMS these standards and tools can be exploited to not only deliver full featured software but also robust maintainable software. The key XML technologies of a CMS are:

Fundamental XML
Fundamental XML includes XML 1.1, XML Namespaces and XML Schema.
XML Database
A CMS requires a database. An XML database lends special value to an XML CMS.
XSLT
XSLT is a rule based transformation language. An XSLT processor executes programs written in the rule based XSLT language.
XSL/FO
XSL/FO is a formatted object language that addresses the layout of content onto pages.

Fundamental XML

Fundamental XML refers to:

XML Data Model
XML's data model (i.e. the InfoSet) is well suited to use for document-centric data.
XML as Text
The fact that XML has a text representation makes it more accessible. For example, if your WYSIWYG XML editor fails to support an XML feature you can usually open any text editor and make your changes. Actually XML is more than mere text, it supports encoding formats that make XML ideal for storing documents in several languages.
XML Namespaces
With the XML namespaces mechanism you can compose documents from many different namespaces and make use of specialized XML vocabularies within a compound document. For example, you may wish to insert a math formula for square root of X into your document. Note the default http://www.w3.org/1998/Math/MathML namespace.
<math xmlns="http://www.w3.org/1998/Math/MathML">
   <msqrt>
      <mi>x</mi>
   </msqrt>
</math>

The challenge, however, becomes how to display this now compound document in your WYSIWYG editor or your end published form be it XHTML or PDF. Amaya and XMetal have editor support for MathML. Strategies for browser support range from generating SVG to generating an image on the fly.

XML Schema
XML schema is a fairly large XML specification designed to replace the somewhat limited Document Type Definition (DTD) technology that is part of the XML specification.

XML Database

Traditionally, content management systems have a relational database for data storage.

CMS Architecture with RDBMS

Figure 1. Classic CMS Architecture with RDBMS for Data Storage.

With an XML Content management system, the most sensible data store is an XML DBMS. (shown in Figure 2).

CMS Architecture with RDBMS

Figure 2. CMS Architecture using XML DBMS for Data Storage.

Relational systems are very poor at storing each of the information items of document-centric data and typically store entire documents as one blob of data making sub-items in the document unavailable to the DBMS. This blob treatment of XML makes it impossible to optimally and flexibly perform full text searches without additional software. As well, with a blob layout, it is not possible to do queries that generate content derived from the documents in the database, derived content such as a table of contents, lists of titles, lists of abstracts or other manipulations of the content. On the other hand, an XML DBMS stores and retrieves XML Documents as fully accessible trees of information items. An XML DBMS will most likely comply with the XML Query Standard, XQuery which allows you to query any item within a document or combine items from multiple documents. As well, an XML DBMS supports security, scheduled backups, transactions, recovery, binary storage, as well as other expected DBMS features. A more extensive discussion of the native XML DBMS is in Chapter 8 of XQuery From the Experts.

XSLT

XSLT plays a large role in transforming the stored XML into a published form, especially when the published form is different from the stored form. By "stored form" we mean the form or vocabulary you have chosen for your XML content. XSLT is especially useful for transforming stored XML into XHTML or HTML. For instance, if your stored XML vocabulary is DITA or Open Document, then you can use XSLT to get an XHTML form for the purpose of displaying in any browser. In fact, the DITA toolkit includes the transformations from DITA to HTML and from DITA to PDF. A very common print format is PDF since Adobe has free printer rendering support for PDF as well as a free browser plugin for viewing PDF. To get to a page based layout like PDF, publishers often use the XML format called formatted objects (part 2 of the XSL specification). This allows XSLT and thus XML technologies to be used for the lion's share of the transformation with the final formatted objects to PDF step being a simpler.

Conclusion

For organizations that already have document-centric XML, you will want to have a CMS system that is XML savy - an XML CMS. For those organizations considering a switch to XML content, an XML CMS will help with the authoring and publishing of your documents. Once you have an XML CMS with XML content, a range of XML technologies offer numerous options for publishing, styling, searching for and otherwise manipulating your content. With an XML CMS, XML content and XML technologies, you can now easily realize the full value of your document content.

References: