A Quick Guide to XML Schema.doc

(106 KB) Pobierz

MSDN Magazine

A Quick Guide to XML Schema


Download the code for this article: XML0204.exe (35KB)

O

f all the XML technologies, XML Schema is most valuable to software developers because it finally makes it possible to add type information to XML documents. This column is the first in a two-part series that will cover the basics of XML Schema.
      First, let's review what preceded XML Schema. The XML 1.0 specification came with a built-in syntax for describing XML vocabularies, called Document Type Definitions (DTD). DTDs have actually been around for quite some time considering XML 1.0 inherited the syntax from its predecessor, the Standard Generalized Markup Language (SGML).
      DTDs allow you to describe the structure of XML documents. For example, say that you want to use the following XML vocabulary to describe employee information:

<employee id="555-12-3434">

  <name>Monica</name>

  <hiredate>1997-12-02</hiredate>

  <salary>42000.00</salary>

</employee>

The following DTD describes the structure of this document:

<!-- employee.dtd -->

<!ELEMENT employee (name, hiredate, salary)>

<!ATTLIST employee

          id CDATA #REQUIRED>

<!ELEMENT name (#PCDATA)>

<!ELEMENT hiredate (#PCDATA)>

<!ELEMENT salary (#PCDATA)>

This DTD can then be associated with the original document through a DOCTYPE declaration, as shown here:

<!DOCTYPE employee SYSTEM "employee.dtd">

<employee id="555-12-3434">

  <name>Monica</name>

  <hiredate>1997-12-02</hiredate>

  <salary>42000.00</salary>

</employee>

      Validation is the main benefit of using DTDs. When a validating XML 1.0 parser reads this XML 1.0 file, it can also read the associated DTD and validate that it conforms to the definition. Using DTDs for validation can reduce the amount of error handling that you must build into the application.
      Although DTDs were well suited for many SGML-based electronic publishing applications, their limitations quickly became apparent when applied to modern software development domains such as those surrounding today's Web initiatives. The main limitations of DTDs are that DTD syntax is not XML-compliant, and DTDs don't support namespaces, typical programming language data types, or defining custom types.
      Since the DTD syntax itself is not XML, you can't use standard XML tools to process the definitions programmatically. Most XML 1.0 processors support DTD validation, but they don't support programmatic access to the information found in the DTD due to the complexity of the syntax.
      Because DTDs were created before XML namespaces even existed, it's no surprise that they don't work well together. In fact, using DTDs to describe namespace-aware documents is like trying to pound a square peg into a round hole. For more on the hideous details of making this work, check out the May 2001 installment of The XML Files column in which I provide a sample namespace-aware DTD. As a result, most developers choose to use either DTDs or namespaces, but not both.
      DTDs were also specifically designed for document-centric systems in which programmatic data types typically don't exist. As a result, only a handful of type identifiers exist for describing attributes (see Figure 1). These type identifiers aren't like anything you're used to working with in your programming language. They're really just special cases of text (CDATA). Again, these types cannot be applied to text-only elements, only to attributes.
      And finally, the DTD type system is not extensible. This means you're stuck with the types described in Figure 1. Creating custom types that make sense in your problem domain is out of the question with DTDs. These limitations are enough to make any XML developer run from DTDs when presented with the exciting new future offered by XML Schema.

XML Schema Basics

      XML Schema is itself an XML vocabulary for describing XML instance documents. I use the term "instance" because a schema describes a class of documents, of which there can be many different instances (see Figure 2). This is analogous to the relationship between classes and objects in today's object-oriented systems. A class is to a schema what an object is to an XML document. Therefore, while using XML Schema, you'll typically be working with more than one document, as well as the schema and one or more XML instance documents.

Figure 2 Namespace Identifier Linkage
Figure 2 Namespace Identifier Linkage

      The elements used in a schema definition come from the http://www.w3.org/2001/XMLSchema namespace, which I'll bind to xsd throughout the rest of this column. The following is the basic schema template:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

  targetNamespace="http://example.org/employee/">

  <!-- definitions go here -->

</xsd:schema>

Schema definitions must have a root xsd:schema element. There are a variety of elements that may be nested within xsd:schema including, but not limited to, xsd:element, xsd:attribute, and xsd:complexType, all of which I'll discuss.
      The fact that a schema definition is an XML document solves the first DTD limitation. You can process schema definitions with standard XML 1.0 tools and services such as DOM, SAX, XPath, and XSLT. The intrinsic simplicity in doing so has fueled an onslaught of schema tools.

XML Schema and Namespaces

      The definitions placed within the xsd:schema element are automatically associated with the namespace specified in the targetNamespace attribute. In the case of the previous example, the schema definitions would be associated with the http://example.org/employee/ namespace.
      The namespace identifier is the key that links XML documents to the corresponding Schema definition (see Figure 2). For example, the following XML instance document contains the employee element from the http://example.org/employee/ namespace:

<tns:employee xmlns:tns="http://example.org/employee/"/>

The employee element's namespace is the same as the targetNamespace in the schema definition.
      In order to take advantage of the schema while processing the employee element, the processor needs to locate the correct schema definition. How schema processors locate the schema definition for a particular namespace is not defined by the specification. Most processors, however, will allow you to load an in-memory cache of schemas that it will use while processing documents. For example, the following JScript®-based code illustrates a simple way to do this with MSXML 4.0:

var sc = new ActiveXObject("MSXML2.XMLSchemaCache.4.0);

sc.add("http://example.org/employee/", "employee.xsd");

var dom = new ActiveXObject("MSXML2.DOMDocument.4.0");

dom.schemas = sc;

if (dom.load("employee.xml"))

  WScript.echo("success: document conforms to Schema");  

else

  WScript.echo("error: invalid instance");

It works similarly in Microsoft® .NET and in most other XML Schema-aware processors.
      You can download a command-line validation utility from the link at the top of this article to experiment with the principles discussed in this column. The validation utility allows you to specify the instance document you would like to validate along with as many schema definitions as you need. The command-line usage is as follows:

c:>validate instance.xml -s schema1.xsd -s schema2.xsd ...

      XML Schema also provides the schemaLocation attribute to provide a hint in the instance document as to the whereabouts of the required schema definitions. The schemaLocation attribute is in the http://www.w3.org/2001/XMLSchema-instance namespace, which was set aside specifically for attributes that are only used in instance documents. I'll bind this namespace to the xsi prefix from now on. The xsi:schemaLocation attribute takes a space-delimited list of namespace identifier and URL pairs, as shown here:

<tns:employee xmlns:tns="http://example.org/employee/"

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:schemaLocation="http://example.org/employee/

                      http://develop.com/aarons/employee.xsd"

/>

      In this case, if the processor doesn't already have access to the appropriate schema definition for the http://example.org/employee/ namespace, it can download it from http://develop.com/aarons/employee.xsd.

Elements and Attributes

      Elements and attributes can be defined as part of the targetNamespace by using the xsd:element and xsd:attribute elements, respectively. For example, let's say that you want to describe the following namespace-aware instance document:

<tns:employee xmlns:tns="http://example.org/employee/"

  tns:id="555-12-3434">

  <tns:name>Monica</tns:name>

  <tns:hiredate>1997-12-02</tns:hiredate>

  <tns:salary>42000.00</tns:salary>

</tns:employee>

The simplest way to accomplish this is through the following schema definition:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

  targetNamespace="http://example.org/employee/">

  <xsd:element name="employee"/>

  <xsd:element name="name"/>

  <xsd:element name="hiredate"/>

  <xsd:element name="salary"/>

  <xsd:attribute name="id"/>

</xsd:schema>

Notice that just by placing the xsd:element/xsd:attribute declarations within the xsd:schema element automatically associates them with the http://example.org/employee/ namespace. These declarations are considered global in the schema since they are children of the root xsd:schema element.
      Since this schema specifies that these elements/attributes are part of the http://example.org/employee/ namespace, they must be associated with that namespace in the instance document (as in the original instance that I listed earlier). Making subtle namespace changes to the instance will cause it to become invalid. For example, you should consider the following document that contains unqualified name, hiredate, and salary elements along with an unqualified id attribute:

<tns:employee xmlns:tns="http://example.org/employee/"

  id="555-12-3434">

  <name>Monica</name>

  <hiredate>1997-12-02</hiredate>

  <salary>42000.00</salary>

</tns:employee>

      Since the previous schema definition states that these elements/attributes are from the http://example.org/employee/ namespace and this time they aren't associated with a namespace, this instance is invalid according to the schema.
      An even subtler change would be to modify the original document so that it uses a default namespace declaration instead of a namespace prefix:

<employee xmlns="http://example.org/employee/"

  id="555-12-3434">

  <name>Monica</name>

  <hiredate>1997-12-02</hiredate>

  <salary>42000.00</salary>

</employee>

Although in this case all of the elements are associated with the default namespace (http://example.org/employee/), the id attribute is still unqualified because default namespaces don't apply to attributes. As a result, this document instance is also considered invalid according to the schema.
      As you can see, XML namespaces are at the very heart of XML Schema. When using XML Schema, you must fully understand how namespaces work because if the instance document doesn't agree with what the schema specifies, it will be invalid.
      You may have also noticed that this simple example does not constrain the content of any of the elements nor does it define the structural relationship between the elements in the namespace. It's equivalent to the following DTD (omitting the attribute declaration for now):

<!ELEMENT employee ANY>

<!ELEMENT name ANY>

<!ELEMENT hiredate ANY>

<!ELEMENT salary ANY>

      Thus the following XML instance document would also be valid according to the schema, even though the document doesn't make any sense:

<tns:name xmlns:tns="http://example.org/employee/">

  <tns:employee>

    <tns:hiredate>42.000</hiredate>

    <tns:salary tns:id="555-12-3434">Monica</tns:salary>

  </tns:employee>

</tns:name>

XML Schema makes it possible to describe an element's structure through complex type definitions.

Defining Complex Types

      With DTDs, an element's content model is defined within an ELEMENT declaration, as shown here:

<!ELEMENT employee (name, hiredate, salary)>

This ELEMENT declaration states that an employee element contains a name element, followed by a hiredate element, followed by a salary element.
      XML Schema makes it possible to define an element's content model in a similar fashion by nesting an xsd:complexType element within the xsd:element declaration, as shown here:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

  targetNamespace="http://example.org/employee/">

  <xsd:element name="employee">

    <xsd:complexType>

      <!-- employee's content model goes here -->

    </xsd:complexType>

  </xsd:element>

</xsd:schema>

      The XML Schema model is more like a programming language where you bind variables to formal type definitions. The xsd:complexType allows you to define the type of an element, which conveys its structure. Nesting xsd:complexType within the element declaration effectively binds it to that element (like a variable). Thinking in terms of type definitions is a major paradigm shift from the way DTDs work.
      What you put inside of the xsd:complexType element is similar to what you put inside parentheses in a DTD ELEMENT declaration. The previous employee ELEMENT declaration specifies an ordered sequence of name, hiredate, and salary elements. Using a pipe (|) separator instead of a comma changes the meaning to a choice of one element:

<!ELEMENT employee (name | hiredate | salary)>

      In XML Schema, you specify the characteristics of the content model through a compositor element, which is nested as a child of the xsd:complexType element. XML Schema defines three compositor elements: xsd:sequence, xsd:choice, and xsd:all (shown in Figure 3).
      The xsd:sequence and xsd:choice elements are equivalent to the DTD examples just shown. However, xsd:all is a new concept—it specifies that the content model consists of all items in any order. This wasn't designed into the DTD syntax, although you could define such semantics by explicitly specifying all of the possible permutations as follows:

<!ELEMENT employee ( (name, hiredate, salary) |

                     (name, salary, hiredate) |

                     (hiredate, name, salary) |

                     (hiredate, salary, name) |

                     (salary, name, hiredate) |

                     (salary, hiredate, name) ) >

As you can see, combinatorial mathematics begins to work against you quickly. The XML Schema approach is much cleaner since "all" is a first-class compositor like sequence and choice.
      Compositor elements may contain references to global element declarations, local element declarations, other compositors, and a few other constructs such as wildcards and group references. The schema example shown in Figure 4 shows how to define an xsd:complexType that references global elements defined elsewhere in the schema.
      Notice that the ref attribute takes a prefixed element name. Remember that once you declare a global element in the schema, it's automatically associated with the targetNamespace. When you reference global elements by name, they are treated as qualified names. Had I used ref="name" instead of ref="tns:name," the schema processor would have looked for the name element associated with no namespace (or the default namespace, had one been used) and it wouldn't have found one since the only name element declared in the schema is the name element from the http://example.org/employee/ namespace.
      If I had made http://example.org/employee/ the default namespace for the document, then I could have referenced the global element names without using a namespace prefix (for example, ref="name"), as shown in Figure 5.
      The sample schemas in Figure 4 and Figure 5 are logically equivalent—they've simply been serialized somewhat differently. Both sample schemas constrain the content of the employee element. Now, the employee element must contain a name, hiredate, and salary element, all of which must be associated with the http://example.org/employee/ namespace.

Local Element Declarations

      Since employee is the only element that I plan to use as a top-level element in the instance documents, there is really no reason to define name, hiredate, and salary as global elements. Instead, I can just define the name, hiredate, and salary elements locally within the employee element's content model.
      For example, the schema shown in Figure 6 contains an employee element declaration that contains a sequence of local element declarations. In this example, the name, hiredate, and salary elements are actually declared as part of the employee element and cannot be used elsewhere in an instance. The employee element declaration is the only one that appears globally as a child of the root xsd:schema element. This brings up an interesting question: should local elements be associated with the target namespace?

Local Scoping and Namespaces

      To help understand the answer to this question, let's consider a similar example in a programming language that supports namespaces, like C#. Review the following C# class definition that has been defined within the "example" namespace:

namespac...

Zgłoś jeśli naruszono regulamin