Troubleshooters.Com® and Web Workmanship Present:
XML Primer
Copyright © 2022 by Steve Litt
See the Troubleshooters.Com Bookstore.
CONTENTS
XML stands for eXtensible Markup Language, and it's a clean and powerful technology. Without going into detail, it can be used to describe and/or extend almost any data structure, hence "eXtensible". The XML specification was first conceived in 1996, and became an official specification in 1998. Thus, it came out years after Tim Berners-Lee invented HTML.
This is a shame because I'm pretty sure if XML had pre-existed HTML, Tim Berners-Lee would have made HTML a dialect of XML, in which case authoring and parsing HTML would have been much easier, and browsers would be somewhat simpler. But that ship has sailed.
The HTML5 specification allows, but doesn't require, HTML to also be well formed XML. The term "well formed" will be defined later on this page. Making your HTML also be well formed XML gives you the following advantages:
The term "well formed XML" simply means conforming to XML syntax, which is incredibly simple. The term "valid XML" refers to valid XML whose content matches its DTD or Schema. Because HTML has no XML DTD or Schema, XML validity plays no part in HTML authoring and won't be mentioned again.
From here on it's assumed you'll be making your HTML also be well formed XML. The entirety of the Web Workmanship project is geared toward HTML that is well formed XML. Pages of the Web Workmanship project sometimes use the word "XMLizing" to refer to making your HTML also well formed XML.
Put simply, XML implements a hierarchy. There's one top level item (node, element, however you want to think about it), which has children, and each of those children can have children, as deep as you want to go. Each item can have as many attributes as desired, with each attribute being something about the item. See the following example, which implements a rudimentary inventory:
<inventory> <books> <b1002 title="Shark" price="24.95" onhand="4"/> <b1001 title="Shark" price="24.95" onhand="4"/> </books> <trinkets> <t3001 name="Clock" price="11.95" onhand="43"/> <t3002 name="Statue" price="59.95" onhand="7"/> </trinkets> </inventory> <about>Mark and Cleopatra enterprises is your premier source for both books and trinkets. Our prices are the best and we feature same day fulfillment. </about> </inventory>
In the preceding, which is a very contrived example, the top level item encompasses the entire inventory. Two of its children, books and trinkets represent the two categories of inventory. The about child has text for its child. Below each of those are the inventory items that belong to each category. Each inventory item has attributes to describe the name or title, price, and current number of the inventory item on hand.
The preceding example could have been organized different ways. Just as one example, the category of each inventory item could have been specified by a category attribute. And a more conceptually accurate way of portraying this data would have been to have the single top level item be company or something like that, and have the company item contain children inventory and about, but as I said, this example is very contrived. The point is, it's a hierarchy.
Hierarchies are a very powerful way to express data. In the past there have been many databases constructed hierarchically instead of relationally. Last time I looked at MongoDB it was a hierarchical database. JSON and Yaml are both hierarchical representations of data: They're more human readable but less configurable than XML. The Scaleable Vector Graphics standard is XML, as are the native formats of LibreOffice (actually consists of several XML hierarchies) and Inkscape.
It's possible to apply specific rule sets, implemented as either DTDs or Schemas, to XML files, so as to limit permissible things that go in the XML file and therefore make it more powerful. However, DTDs and Schemas are beyond the scope of this document.
Like HTML, XML also has container-capable elements and non-container-capable elements, with the former having end tags, and the latter having a forward slash before the ending angle bracket. Like HTML, XML attributes are information about an element, consist of a name and a value, with an equal sign and the value quoted, as in the following example:
color="brown
Note:
XML actually lets you end any element not containing text or other elements with a slash before its closing angle bracket. However, the HTML5 spec insists that any element capable of containing anything end with a closing tag, whether that element actually has children or not.
This section is long and difficult. Take your time with it. You don't need to memorize it, just understand it. You can always bookmark this section for reference. Once you understand this section, you'll be a much more powerful HTML author.
Well formed XML is simply XML that conforms to XML syntax rules. The syntax is pretty simple:
The remainder of this section elaborates on these rules...
Wrong: <p>A paragraph<P>
Right: <P>A paragraph<P>
Also right: <p>A paragraph<p>
For the purposes of HTML my advise is to make all tags entirely lower case. This makes things much less confusing. Remember that later, when you get into CSS and Javascript, CSS and Javascript element names must case-match those in the actual HTML.
Wrong: <p>This is a paragraph.<hr>
Right: <p>This is a paragraph.</p><hr/>
Also right: <p>This is a paragraph.<hr></hr>
Say what???
Earlier on this page I said that <hr/> cannot contain anything, and therefore must end in /> rather than having a closing tag. And yet the final correct example above used a closing tag! What's going on?
What's going on is the difference between XML syntax and the HTML5 specification. XML syntax says nothing at all about the meaning of <hr/> or whether it can contain other material. In terms of the XML syntax rules, "hr" could stand for "highway rating", which would presumably be just a number, or it could stand for "hunting retailers", which would presumably contain several elements each of which details a single hunting retailer. Or it could stand for "cars": The human meaning doesn't matter to XML syntax.
The specification that declares <hr/> to be a non-container is the HTML5 specification. The HTML5 specification is brought into the HTML file by its top line, which looks like <doctype html>.
Nested means one element contains one or more elements. They may, but don't have to be nested. If they are, they must be nested properly, meaning that their tags must not be interleaved:
Wrong: <p>This is a <em>paragraph.</p></em>: Interleaved tags.
Right: <p>This is a <em>paragraph.</em></p>: Proper nesting.
Right: <p>This is a <em>paragraph</em>.</p>: Proper nesting, and sentence-ending period is not part of the span
The "wrong" example in the preceding examples interleaves tag (begin-paragraph begin-em end-paragraph end-em). This violates XML grammar. Looking at it from a common sense viewpoint, does the paragraph contain the em, or does the em contain the paragraph? Whoops!
In the two "right" examples, there's no interleaving, and it's clear that the paragraph contains the em. By the way, em means "emphasis" in HTML, so everything contained within the em element is emphasized on a web page. What such emphasis actually looks like is determined by the browser and any CSS in or imported into the web page.
First in, last out!
One way I like to remember this is "First in, last out". If div's begin tag comes before p's begin tag, then div's end tag comes after p's end tag. If you're a developer, think of it this way: Nesting is always a stack, never a FIFO.
Referring to the preceding examples again, the final example removes the period (dot) from inside the emphasis and places it outside the emphasis. Typically, you want to emphasize words,not punctuation, although this is just an authoring standard, not a requirement for HTML5 or XML.
An attribute is a fact about an element. It is expressed as a key-value pair, where the key is the name of the attribute, and its value is the value. An attribute is not something contained in the element. This fact is further explained in the Element Attributes vs Element Child Elements section later on this page.
Now that the distinction is clear, let's talk about attributes:
Wrong: <p class=booktitle>
Right: <p class="booktitle">
Also right: <p class='booktitle'>
In the preceding examples, notice the following:
Wrong:
<myroot> <p>Steve was here,</p> <p>and now is gone,</p> </myroot> <otherroot> <p>but left his name,</p> <p>to carry on.</p> </otherroot>
Right:
<oneandonlyroot> <p>Steve was here,</p> <p>and now is gone,</p> <p>but left his name,</p> <p>to carry on.</p> </oneandonlyroot>
Basically, the root element must contain all the other elements. In HTML that root element is <html></html>. Be aware that the <!doctype html>line at the top is not part of the XML hierarchy, and is not an element, so it can exist outside the root element (<html></html>).
XML is a remarkably well designed, powerful, useful and robust technology. Except for comments. In XML it's best to use comments for commenting, not for "commenting out" sections of code, because comments have so many exceptions. I recommend any time you "comment out" XMLized HTML, you rerun a check for well-formedness.
This section gives some information as to comments in an XML environment...
Well formed: <p>XML is great</p>: Doesn't contain a comment.
Well formed: <!--<p>XML is great</p>-->: Comment complies with XML grammar.
NOT well formed: <!--<p><!--XML is great--></p>-->: XML doesn't allow nesting of comments.
NOT well formed: <!--<p>XML is great--></p>: Paragraph and comment are interleaved with comment starting first. Elements must be properly nested within comments.
NOT well formed: <p><!--XML is great</p>--> Paragraph and comment are interleaved with paragraph starting first.
NOT well formed: <!--<p>XML is -- great</p>-->: Unbelievably, two hyphens inside a comment breaks XML syntax.
Well formed: <!--<p>XML is - - great</p>-->: Putting a space between the two hyphens inside the comment makes it well formed again.
NOT well formed: <!--<p>XML is great</p>--->: Ending a comment with three hyphens goes against XML grammar.
NOT well formed: <!--<p>XML is great</p>---->: Ending a comment with more than two hyphens goes against XML grammar.
Well formed: <!---<p>XML is great</p>-->: Paradoxically you can begin the comment with as many hyphens as you want.
NOT well formed: <p><!--XML<!-- is great--></p>: Unmatched comments: Too many comment starts.
NOT well formed, and wrong:<p><!--XML --> is great--></p> : Unmatched comments: Too many comment ends. Browser stops the comment after the first comment end.
Stop the madness! Use comments only to comment, not to comment out. Check for XML well-formedness after inserting comments.
You haven't yet read enough of Web Workmanship to read about validating, but if you want a sneak peak, see the material about xmlchecker.py.
You test your XMLized HTML file using an XML parser. I've created a Python 3 based XML checker using Python's xml.etree.ElementTree
Testing for XML well-formedness is beyond the scope of this page, but rest assured, it's available in the Web Workmanship project. If you want, you can get a sneak peek at a nice, simple checker for XML well-formedness, but then come back here to finish this HTML Primer.