XML Technologies

By Mary Brady, Carmelo Montanez-Rivera, Richard Rivello, and Lisa Carnahan
Software Diagnostics and Conformance Testing Division
Information Technology Laboratory
National Institute of Standards and Technology

Introduction

An Internet language called the Extensible Markup Language (XML) is rapidly becoming one of the most popular languages in the world. XML is being incorporated into many Internet Web pages and applications; it is particularly useful for those involving structured information exchanges, such as electronic commerce (EC). It is a language that describes information in a way that allows computers to exchange information and automatically act on the information. Consequently, it can speed up automation of certain processes. Surrounding XML is a family of technologies that either augment the language or further build upon it. It is sometimes difficult to sift through the various XML technologies and determine which ones best fit the needs of an organization.

This ITL Bulletin introduces the key technologies needed to incorporate XML in your architecture, discusses programmatic methods for manipulating and displaying XML data, and identifies ITL resources that can be used in evaluating viable XML-related solutions.

Extensible Markup Language

The Extensible Markup Language provides a standards-based approach to universal methods for defining and exchanging data. XML has its roots in SGML and was originally designed for use in large-scale electronic publishing. Its use as a display-independent file format has grown far beyond the publishing arena – so much so, that it is commonly referred to as the ASCII of the 21^st century. In addition, XML has blossomed as a data interchange format for the World Wide Web. To illustrate its usefulness, imagine that your company sells products online and maintains the following customer information.

<customer percent-discount=”5”>
<name>Richard Kimball</name>
<address>
<street>100 Hawthorne Dr</street>
<city>Any City</city>
<state>SC</state>
<zip>123456</zip>
</address>
</customer>

Although XML is meant to be machine processed, it is still possible to easily understand the information within an XML file and to make quick changes with a simple editor, if necessary, such as the percent-discount must be changed from 5 to 10 in order to beat your competitor’s price. Like HTML, XML consists of a set of matching start and end tags, known as elements. Name/value pairs (such as percent-discount=”5”), or attributes, can be associated with a given element. XML processors are much stricter than their HTML counterparts and will reject any documents that are not well formed. In particular, XML elements must have matching start and end tags, be properly nested, and all attribute values must be enclosed in quotes. A document type definition (DTD), which can further define constraints for the elements and associated attributes of an XML file, can optionally be associated with an XML document. If a DTD exists and the corresponding XML document adheres to all of the DTD constraints, then the document is said to be valid. For the above example, one might define a DTD that indicates that the percent-discount must be either “0,” “5,” or “10.” All other values would be invalid.

The World Wide Web Consortium (W3C)XML 1.0 Recommendation is the specification that defines the rules for elements and attributes. There are a number of optional technologies that define mechanisms for naming, identifying, and locating portions of an XML document, including XML Namespaces, XLink, XPointer, and XPath. These will ultimately be combined to allow XML fragments to be uniquely named, labeled, and made available for use throughout the Web.

Although the XML Recommendations are relatively new specifications, there are already millions of XML pages appearing on the Web. Virtually all application domains (i.e., vertical markets) are looking to use XML to define and exchange structured information. In addition, XML processors are beginning to appear in popular Web browsers and associated applications. As such, interoperability among these implementations has become paramount.

To address this need, NIST's Information Technology Laboratory (ITL) partnered with the Organization for the Advancement of Structured Information Standards (OASIS) to develop the XML Test Suite. The test suite includes over 2000 tests, with contributions from key member companies. In developing the test suite, the committee also developed an XML test description file that is used for two purposes. First, it is used in conjunction with a stylesheet to generate a test coverage report, and secondly, it is used to aid in automated test execution and evaluation. This use of XML to describe information pertaining to the tests has led to efficient and widespread use of the test suite. The XML Test Suite is available from the ITL Web site, http://www.nist.gov/xml/.

XML Schemas

Automated processing of XML has increased the need for a more rigorous approach to constraining XML data than what is provided by a DTD. XML Schemas are addressing this need by further defining methods for document structure, data typing, and conformance to cursory XML documents. Document structure will include constraints for namespaces, elements, and attributes, as well as content contained within entities and notations. In addition, schemas permit inheritance, embedded documentation, and application-specific constraints and descriptions. Data typing includes primitive data typing, such as byte, date, integer, sequence, SQL and Java™ primitive data types, and user-defined data types.

The W3C XML Schema effort is currently in final review and as such, implementations are beginning to appear. ITL will again work with W3C and OASIS to develop a test suite that can be used to test schema implementations.

Displaying XML: Using Stylesheets

Unlike HTML, XML tags do not define display characteristics that can be interpreted by a browser. Instead, display issues are left for associated stylesheets that can be defined for a specific XML document. Two methods currently exist and have been summarized by W3C as follows:

Cascading Style Sheets (CSS)
Extensible Style Sheets (XSL)

CSS has been associated with HTML for some time and can also be used to display XML data in much the same fashion. The W3C recommends that CSS be used for displaying XML data when possible. For further information on CSS, see the references at the end of this article. The W3C XSL (www.w3c.org/TR/XSL) is the advanced language for expressing stylesheets and consists of two parts:

XSL Transformations(XSLT) – W3C Recommendation
Extensible Stylesheet Language (XSL) – W3C Working Draft

By applying a style sheet to an XML file, it is possible to define your data once and then transform and display that data for use in a variety of mediums. The most common of these display mechanisms is a Web browser. The use of style sheets within a Web browser allows a web site designer to easily control formatting, such as font size, color, and text alignment throughout a set of web pages. This control over the layout of a given page can be used to enhance accessibility; however, it is important to ensure that your web pages meet the accessibility guidelines of your organization and that they are viewable by those browsers that either do not support style sheets or have disabled support for style sheets. In addition, be careful not to interfere with the user-defined style sheet capability in some of the newer browsers. Style sheets are not limited to Web browsers – additional style sheets could be defined to display this same XML data on a wireless phone, a printed book, a catalog, or a brochure. This ability to define data once and use it everywhere provides flexibility within an organization, where information is typically used in a variety of formats.

The presentation of XML data using XSL is a two-step process. The first part of the process is accomplished by applying a set of template rules defined in XSLT to the XML source tree. Portions of a particular XML document may optionally be included using the XML Path Language (XPATH). The result of this transformation is known as the result tree. Quite often, the result tree is significantly different from the source tree. For example, one view of an XML document may produce a table of contents by selecting major headings and another view of that same data may produce a detailed report.

In the second part of the presentation process, the result tree and associated formatting semantics are interpreted for display by a formatting device, such as a Web engine, a wireless engine, or a print engine. The formatting semantics are defined through a set of classes and properties. The classes denote high-level abstractions such as a page, paragraph, or table. Finer-level control of these objects is provided by a set of formatting properties, such as indents and spacing between words and letters. All aspects of the presentation are controlled through both the classes of formatting objects and the formatting properties.

XSL files are XML files and therefore must follow the well-formed rules for XML. XSL documents start with the “<xsl-stylesheet>” and end with the “</xsl-stylesheet>” tags, respectively.

As the group of XSL specifications approach maturity, implementations are beginning to appear. ITL is again working with W3C and OASIS to develop a comprehensive test suite for XSL processors. The test suite will focus on two key areas. The first, XSLT, deals with the transformation rules of the data into other vocabularies. The second focus area is XPath for the common syntax and semantics. ITL is also developing a set of XSLT/XPath tests that builds on the work done by the Lotus Organization, who originally developed an early version of an XSLT/XPath test suite. This set of tests addresses a number of issues ranging from IEEE 754 Boolean operations to specific function operations and templates defined in the XSLT/XPath specifications and are mostly “depth of coverage” type of tests.

The test suite will be a collection of tests submitted by different member organizations and will allow users to customize the test suite to filter out optional category tests. This version of the test suite will not include the formatting objects part of XSL; however, research is under way to determine how ITL can make a contribution to this part of the specification. There are approximately 1600 tests thus far and work continues. The final test suite is expected to be complete by March 2001 and be publicly available.

Manipulating XML via the DOM

The ability to dynamically access and update the content, structure, and style of documents first became available with the advent of dynamic HTML. Its use immediately spread among Web programmers, who were then able to animate a set of Web pages, and in a sense, bring life to the Web. As the technology progressed, divergent implementations began to appear. In response to this issue, a standard set of interfaces was defined by W3C, who strived to create a specification that was not only platform and language-neutral but also provided maximum interoperability for Web authors. In October 1998, the Document Object Model (DOM) Level 1 Recommendation was released by the W3C and detailed the interfaces used to access and update XML and HTML documents. The original version of dynamic HTML focused entirely on HTML. It is informally referred to as DOM Level 0 and has been folded into Level 1. The DOM is an API (Applications Programming Interface) that allows XML and HTML data to be manipulated. The data can be dynamically added, removed, or modified within the document.

W3C has since released DOM Level 2, which is at Candidate Recommendation stage and implements Level 1 functionality along with new features. Some new features included are the ability to manipulate style information contained in a document and provide support for XML namespaces. W3C is currently working on the Public Working Draft for DOM Level 3.Level 3 continues to build on Level 2 and additional features include defining a standard method for loading and saving documents along with support for document validation.

DOM implementations allow programmers to create applications that will work on all browsers and servers. This functionality is not limited to Web clients and servers, but is also being used in applications that make use of XML as a display-independent format. Although only the ECMAScript and Java™ bindings are defined in the W3C specifications, implementations exist for Visual Basic®, Visual C++®, C, Python, and Perl. Although programmers may need to use different programming languages as they move from application to application, they will be able to take advantage of a consistent programming model.

In order to test DOM implementations for conformance to the W3C Recommendation, ITL has developed a comprehensive set of tests that will check for compliance with the DOM Level 1 Recommendation. ITL staff has concentrated on the two bindings defined within the specification, namely ECMAScript and Java™. The ECMAScript tests cover the fundamental, extended and HTML interfaces of the DOM Level 1 Recommendation and number over 800. The ECMAScript suite allows the user to select a particular category from a drop-down menu and a specific interface that contains a series of tests. The tests within the selected interface are then executed outputting the results. The results produced are displayed in a color-coded tabular format and contain the test name, a brief description of the test, and expected and actual results. If the expected and actual results match, the test will be highlighted in blue with incorrect results highlighted in red.

The Java™ tests cover the fundamental and extended interfaces of the DOM Level 1 Recommendation and number over 200. They are organized into a set of classes – one class for each interface. Inside each class, you’ll find a set of methods that exercise a particular interface. The results are returned in a HTML file and, like the ECMAScript tests, are highlighted in blue if they are correct and in red if they fail. The Java™ tests also offer the flexibility of being able to run the tests offline. You can download the test suite software and the parser you wish to test and run the tests for a particular interface from the command line.

Source code and documentation for both test suites are available and can be downloaded. The test suite is available at http://xw2k.sdct.itl.nist.gov/xml/dom-test-suite.html with links to both bindings.

Registries/Repositories

Together, all of the above specifications represent the core capabilities necessary to define, manipulate, and display data from an XML-based language. Businesses of all kinds are pursuing standard XML definitions for use within their electronic marketplaces.

In the simplest sense, the benefits of XML will be achieved only if organizations of a significant number are using the same XML definitions. Therefore, these XML definitions must be available for partners to discover and retrieve. A registry/repository is a mechanism used to discover and retrieve documents, templates, and software (i.e., objects and resources) over the Internet. A registry is the mechanism used to discover the object. The registry provides information about the object, including the location of the object. A repository is where the object resides. A user retrieves an object from a repository.

Although XML is a recent newcomer in the electronic commerce landscape, supply chains in many industries, as well as industry consortiums and standards organizations, are using XML to define their own vocabularies for business relationships and transactions. The vocabularies, business templates, and business processes used by these groups to transact business must be accessible by all partners at any time. Furthermore, newcomers to the supply chain or business partnerships must be able to discover these documents and retrieve them. A registry and repository can be used to provide this service. A series of registries and repositories can link many organizations and industries, acting as a Web of registries for discovery. Standards are needed to ensure interoperability of these registries; additionally, a registry vocabulary must be created for consistency of discovery information among them.

As members of the OASIS Registry/Repository Working Group (Reg/Rep WG), ITL researchers have served as primary authors of the draft specification, OASIS Registry/Repository Specification. ITL serves as one of two implementers to the emerging specification. The ITL implementation will be considered the reference implementation. ITL provides feedback into the OASIS Reg/Rep WG based on lessons learned from the implementation work.

ITL is also working within the ebXML Project (http://www.ebxml.org), a joint project between OASIS and the United Nations body for Trade Facilitation and Electronic Business (UN/CEFACT). This is the prominent business-oriented, international standards organization for the discovery, retrieval, and use of business processes and related documents. ITL provides input into the ebXML Registry/Repository Project Team based on the lessons learned from the OASIS work.

The role of ITL is to influence the quality, correctness, and testability of the specifications of both the OASIS and ebXML Registry/Repository Working Groups through our reference implementation of a registry and repository that is conformant to both specifications. Additionally, ITL facilitates crossover discussions between OASIS and ebXML, thus helping to ensure compatibility of the specifications.

ITL, through its leadership in developing a reference implementation of a registry/repository to both the OASIS and ebXML specifications, will help ensure that both specifications are unambiguous, complete, and testable. This work will also contribute to the compatibility of these two specifications. The completion of these specifications will allow small- and medium-sized enterprises (SMEs) to make appropriate choices with regard to EC tools and applications, and will allow them access to the emerging supply chain and industry partnership EC models.

For More Information

Sorting through the myriad of XML-related technologies is a daunting task. We have presented an overview of some of the core technologies necessary to define and use XML within an organization.

ITL, through its leadership in developing conformance test suites and reference implementations, will help ensure that the XML family of technologies is unambiguous, complete, and testable. Furthermore, a set of metrics for determining conformance to these specifications is available for use in testing particular implementations.

For further reading, consult the W3C Web site, http://www.w3.org/ and ITL project pages, available from http://www.nist.gov/itl/div897/.

® Visual Basic and Visual C++ are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

™Java and all Java based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries.

Disclaimer: Any mention of commercial products or reference to commercial organizations is for information only; it does not imply recommendation or endorsement by the National Institute of Standards and Technology nor does it imply that the products mentioned are necessarily the best available for the purpose.