Troubleshooters.Com® and Web Workmanship Present:

XML Primer

See the Troubleshooters.Com Bookstore.

CONTENTS

Introduction

XML stands for eXtensible Markup Language, and it's a clean and powerful technology. Without going into detail, it can be used to describe and/or extend almost any data structure, hence "eXtensible". The XML specification was first conceived in 1996, and became an official specification in 1998. Thus, it came out years after Tim Berners-Lee invented HTML.

This is a shame because I'm pretty sure if XML had pre-existed HTML, Tim Berners-Lee would have made HTML a dialect of XML, in which case authoring and parsing HTML would have been much easier, and browsers would be somewhat simpler. But that ship has sailed.

The HTML5 specification allows, but doesn't require, HTML to also be well formed XML. The term "well formed" will be defined later on this page. Making your HTML also be well formed XML gives you the following advantages:

The term "well formed XML" simply means conforming to XML syntax, which is incredibly simple. The term "valid XML" refers to well formed XML whose content matches its DTD or Schema. Because standard HTML version 5 (HTML5) has no XML DTD or Schema, XML validity plays no part in HTML authoring and won't be mentioned again.

I wouldn't be caught dead writing HTML that wasn't also well formed XML. You shouldn't either.

From here on it's assumed you'll be making your HTML also be well formed XML. The entirety of the Web Workmanship project is geared toward HTML that is well formed XML. Pages of the Web Workmanship project sometimes use the word "XMLizing" to refer to making your HTML also well formed XML.

XML Implements a Hierarchy

Put simply, XML implements a hierarchy. There's one top level item (node, element, however you want to think about it), which has children, and each of those children can have children, as deep as you want to go. Each item can have as many attributes as desired, with each attribute being something about the item. See the following example, which implements a rudimentary inventory:

<inventory>
 <books>
  <b1002 title="Shark" price="24.95" onhand="4"/>
  <b1001 title="Shark" price="24.95" onhand="4"/>
 </books>
 <trinkets>
  <t3001 name="Clock" price="11.95" onhand="43"/>
  <t3002 name="Statue" price="59.95" onhand="7"/>
  </trinkets>
 </inventory>
 <about>Mark and Cleopatra enterprises is your
  premier source for both books and trinkets.
  Our prices are the best and we feature same
  day fulfillment.
 </about>
</inventory>

In the preceding, which is a very contrived example, the top level item encompasses the entire inventory. Two of its children, books and trinkets represent the two categories of inventory. The about child has text for its child. Inside each of those are the inventory items that belong to each category. Each inventory item has attributes to describe the name or title, price, and current number of the inventory item on hand.

The preceding example could have been organized different ways. Just as one example, the category of each inventory item could have been specified by a category attribute. And a more conceptually accurate way of portraying this data would have been to have the single top level item be company or something like that, and have the company item contain children inventory and about, but as I said, this example is very contrived. The point is, it's a hierarchy.

Hierarchies are a very powerful way to express data. In the past there have been many databases constructed hierarchically instead of relationally. Last time I looked at MongoDB it was a hierarchical database. JSON and Yaml are both hierarchical representations of data: They're more human readable but less configurable than XML. The Scaleable Vector Graphics standard is XML, as are the native formats of LibreOffice (actually consists of several XML hierarchies) and Inkscape.

It's possible to apply specific rule sets, implemented as either DTDs or Schemas, to XML files, so as to limit permissible things that go in the XML file and therefore make it more powerful. However, DTDs and Schemas are beyond the scope of this document.

Similarities Between XML and HTML

Like HTML, XML has container-capable elements that have end tags. Actually, well-formed XML allows any element to be a container, although often its Schema or DTD often prevents some elements from being containers. Like HTML, XML attributes are information about an element, consist of a name and a value, with an equal sign and the value quoted, as in the following example:

color="brown

Note:

XML elements not containing other elements must either have an end tag or have a forward slash before their ending angle bracket. However, the HTML5 spec insists that any element capable of containing anything end with a closing tag, whether that element actually has children or not. This is explained in more detail later in this document.

What Is Well Formed XML?

This section is long and difficult. Take your time with it. You don't need to memorize it, just understand it. You can always bookmark this section for reference. Once you understand this section, you'll be a much more powerful HTML author.

Well formed XML is simply XML that conforms to XML syntax rules. The syntax is pretty simple:

  1. XML tags are case sensitive.
  2. Every tag must have a closing tag or a closing />.
  3. XML elements may be nested --- properly.
  4. Attribute values must be quoted.
  5. An XML document must have exactly one root element. This root element must directly or indirectly contain all other elements in the XML document.
  6. Anything between strings <!-- and a matching --> is a comment, regardless of line breaks, but...

The remainder of this section elaborates on these rules...

XML tags are case sensitive.

Wrong: <p>A paragraph<P>

Right: <P>A paragraph<P>

Also right: <p>A paragraph<p>

For the purposes of HTML my advise is to make all tags entirely lower case. This makes things much less confusing. Remember that later, when you get into CSS and Javascript, CSS and Javascript element names must case-match those in the actual HTML.

Every tag must have a closing tag or a closing />.

Wrong: <p>This is a paragraph.<hr>

Right: <p>This is a paragraph.</p><hr/>

Also right: <p>This is a paragraph.</p><hr></hr>

Say what???

In the HTML Primer I said that <hr/> cannot contain anything, and therefore must end in /> rather than having a closing tag. And yet the final correct example above used a closing tag! What's going on?

What's going on is the difference between XML syntax and the HTML5 specification. XML syntax says nothing at all about the meaning of <hr/> or whether it can contain other material. In terms of the XML syntax rules, "hr" could stand for "highway rating", which would presumably be just a number, or it could stand for "hunting retailers", which would presumably contain several elements each of which details a single hunting retailer. Or it could stand for "cars": The human meaning doesn't matter to XML syntax.

The HTML5 specification that declares <hr/> to be a non-container. The HTML5 specification is brought into the HTML file by the top line, which looks like <doctype html>.

XML elements may be nested --- properly.

Nested means one element contains one or more elements. They may, but don't have to be nested. If they are, they must be nested properly, meaning that their tags must not be interleaved:

Wrong: <p>This is a <em>paragraph.</p></em>: Interleaved tags.

Right: <p>This is a <em>paragraph.</em></p>: Proper nesting.

Right: <p>This is a <em>paragraph</em>.</p>: Proper nesting, and sentence-ending period is not part of the span

The "wrong" example in the preceding examples interleaves tag (begin-paragraph begin-em end-paragraph end-em). This violates XML grammar. Looking at it from a common sense viewpoint, does the paragraph contain the em, or does the em contain the paragraph? Whoops!

In the two "right" examples, there's no interleaving, and it's clear that the paragraph contains the em. By the way, em means "emphasis" in HTML, so everything contained within the em element is emphasized on a web page. What such emphasis actually looks like is determined by the browser and any CSS in or imported into the web page.

First in, last out!

One way I like to remember this is "First in, last out". If div's begin tag comes before p's begin tag, then div's end tag comes after p's end tag. If you're a developer, think of it this way: Nesting is always a stack, never a FIFO.

Referring to the preceding examples again, the final example removes the period (dot) from inside the emphasis and places it outside the emphasis. Typically, you want to emphasize words,not punctuation, although this is just an authoring standard, not a requirement for HTML5 or XML.

Attribute values must be quoted.

An attribute is a fact about an element. It is expressed as a key-value pair, where the key is the name of the attribute, and its value is the value. An attribute is not something contained in the element. This fact is further explained in the Element Attributes vs Element Child Elements section later on this page.

Now that the distinction is clear, let's talk about attributes:

Wrong: <p class=booktitle>

Right: <p class="booktitle">

Also right: <p class='booktitle'>

In the preceding examples, notice the following:

As mentioned previously, HTML5 doesn't require quotes around a single-word attribute value. But XML does require the quotes universally, and if you forget them, the XML parser driven HTML checker introduced in the Validating and Debugging HTML, CSS and Javascript will fail because of the absence of these attribute value quotes.

An XML document must have exactly one root element.

Wrong:

<myroot>
   <p>Steve was here,</p>
   <p>and now is gone,</p>
</myroot>
<otherroot>
   <p>but left his name,</p>
   <p>to carry on.</p>
</otherroot>

Right:

<oneandonlyroot>
   <p>Steve was here,</p>
   <p>and now is gone,</p>
   <p>but left his name,</p>
   <p>to carry on.</p>
</oneandonlyroot>

Basically, the root element must contain, directly or indirectly, all the other elements. In HTML that root element is <html></html>. Be aware that the <!doctype html>line at the top is not part of the XML hierarchy, and is not an element, so it can exist outside the root element (<html></html>).

Anything between strings <!-- and a matching --> is a comment, regardless of line breaks, but...

XML is a remarkably well designed, powerful, useful and robust technology. Except for comments. In XML it's best to use comments for commenting, not for "commenting out" sections of code, because comments have so many exceptions. I recommend any time you "comment out" XMLized HTML, you rerun a check for well-formedness.

This section gives some information as to comments in an XML environment...

Well formed: <p>XML is great</p>: Doesn't contain a comment.

Well formed: <!--<p>XML is great</p>-->: Comment complies with XML grammar.

NOT well formed: <!--<p><!--XML is great--></p>-->: XML doesn't allow nesting of comments.

NOT well formed: <!--<p>XML is great--></p>: Paragraph and comment are interleaved with comment starting first. Elements must be properly nested within comments.

NOT well formed: <p><!--XML is great</p>--> Paragraph and comment are interleaved with paragraph starting first.

NOT well formed: <!--<p>XML is -- great</p>-->: Unbelievably, two hyphens inside a comment breaks XML syntax.

Well formed: <!--<p>XML is - - great</p>-->: Putting a space between the two hyphens inside the comment makes it well formed again.

NOT well formed: <!--<p>XML is great</p>--->: Ending a comment with three hyphens goes against XML grammar.

NOT well formed: <!--<p>XML is great</p>---->: Ending a comment with more than two hyphens goes against XML grammar.

Well formed: <!---<p>XML is great</p>-->: Paradoxically you can begin the comment with as many hyphens as you want.

NOT well formed: <p><!--XML<!-- is great--></p>: Unmatched comments: Too many comment starts.

NOT well formed, and wrong:<p><!--XML --> is great--></p> : Unmatched comments: Too many comment ends. Browser stops the comment after the first comment end.

Stop the madness! Use comments only to comment, not to comment out. Check for XML well-formedness after inserting comments.

How To Test for XML Well-Formedness

You haven't yet read enough of Web Workmanship to read about validating, but if you want a sneak peak, see the material about xmlchecker.py.

You test your XMLized HTML file using an XML parser. I've created a Python 3 based XML checker using Python's xml.etree.ElementTree

Testing for XML well-formedness is beyond the scope of this page, but rest assured, it's available in the Web Workmanship project. If you want, you can get a sneak peek at a nice, simple checker for XML well-formedness, but then come back here to finish this XML Primer.

Where to Go From Here

You have a good grounding in HTML now, and understand XML enough to make your HTML also be well formed XML. You know that the way you apply appearances to HTML is via CSS. But you're not quite ready for CSS, because first you need to understand Content, Styles and Appearances.


[ Training | Troubleshooters.Com | Email Steve Litt ]