On the surface, XML looks like HTML. Both are derived from the Standard Generalized Markup Language (SGML). Tools that generate HTML can often be reused to generate XML.
XML is different from HTML in two key areas: syntax and semantics.
Both HTML and XML use <, >, and & to create element and attribute structures. While HTML browsers accept or ignore mangled markup language, XML parsers and applications built on those parsers are less forgiving. Errors in XML syntax halt document processing, and users or applications receive error messages, not a best-guess interpretation of the document structure.
XML documents must be well-formed. That is, they must follow rules for identifying document parts and creating nested element structures. These rules include:
<b>This is bold text. <i>This is bold italic text.</b> This is italic text.</i>
In some HTML browsers, this text appears as follows.
This is bold text. This is bold italic text. This is italic text.
In an XML parser, however, all processing halts as soon as </b>
is encountered because the XML parser is looking for </i>
, and will not accept </b>
. To achieve the same formatting in XML, use the following syntax.
<b>This is bold text.</b> <i><b>This is bold italic text.</b> This is italic text.</i>
This extra work for XML document creators results in a leap forward for interoperability. Because XML processors have far less "guessing" code, they fit more easily into smaller-scale processing, like embedded systems. Structural ambiguities are eliminated from XML documentsall XML parsers see the same nested element structures.
<
, >
, and &
. For more information, see Character and Entity References.Although XML is unforgiving about syntax, it offers developers more options for defining meaning in XML documents. HTML is basically one vocabulary with a few variations; <b> always means the same thing to an HTML processor. With XML, you can create your own markup vocabulary or choose from markup vocabularies appropriate to your industry or project type. Schemas and document type definitions (DTDs) let you describe these vocabularies, but you can also create documents using vocabularies without formal definitions. Namespaces help you identify the vocabulary you are using.
This approach requires architectures different from those used by browsers. Developers cannot count on XML applications to understand what their markup means or how it is to be presented, understandings that were built into HTML browsers. Browsers can still present XML, but require a style sheet to format to your specifications. These style sheets are built using cascading style sheets (CSS) or XSL Transformations (XSLT). Some browsers, including Internet Explorer 5.0 and later, include a default style sheet, but it is designed more for diagnostics than for presenting information to end users.
XML applications can also bring their own logic to XML vocabularies, rather than relying on style sheets. This logic may take the form of simple scripts or binding to particular presentation modes, or it may involve writing an entire application from scratch. These applications can take advantage of their built-in knowledge of the labeled structures contained in XML documents to process the information in those documents, present them to users, connect them with other data sources, or redirect them to other appropriate consumers.