Troubleshooters.Com and eBook Tech Present

Authoring and Manipulating XML
with Python lxml

Career skills nobody else teaches


CONTENTS

Introduction

By Steve Litt

Python's lxml is a spectacular way to programmatically manipulate XML. It installs via package on modern major Linux distros, it has a relatively easy installer on Windows, and modern OS/x versions have lxml pre-installed. The web contains many spectacular documents about lxml, including the following:

I'm not going to try to recreate the preceding expertly crafted documents. Instead, this document will attempt to explain things that fall through the cracks on most web searches, such as how to insert a doctype into a document. Also, this document assumes you can work with Python, and that you have a basic knowledge of how XML works.

I started writing this document based on things I learned while creating my Xhtml to ePub book creator, which accepts a multichapter book in a single Xhtml file, and converts it to an ePub file. So naturally, this document has an Xhtml flavor, although most of what you read here applies to all XML.

Also, the lxde Python API works with just plain HTML and Xhtml. However, the emphasis of this document is well formed and valid XML. When you start departing from well formed and valid XML, you start depending on rendering applications to guess what you meant, and render what they think you meant. What could possibly go wrong? For this reason, the very next section is a trivial XML syntax checker to make sure the XML is well formed. Use it early and often.

Note

Here are the definitions of "well formed" and "valid":

XML Well-Formed Tester

By Steve Litt

Xhtml can be expressed as well formed XML, but it doesn't have to be. But you can take this to the bank: It should. Because if it's not, the software doing the rendering must guess what you "meant". That's never good.

This article gives you a trivial Python program that tests an XML file for being well formed. Call the program xmlchecker.py. Here it is:

#!/usr/bin/python
import sys
import re
import xml.etree.ElementTree as ET

fname = sys.argv[1]
print('\nTesting for well formedness {} ...\n'.format(fname))
try:
    tree = ET.parse(fname)
except ET.ParseError, err:
    (line, col) = err.position
    code = str(err.code)
    errmsg = 'ERROR: {}'.format(str(err))
    print(errmsg)
    if re.search(' ', errmsg):
        print('Replace all   with   to solve problem.')
else:
    print('Congrats, {} is well formed!'.format(fname))
print('')

There's an easy way and a hard way to use the preceding program. The easy way is to use it early and often, so that when it finds a mistake, you know to look in what you wrote or changed in the past few minutes. It takes about one second to run this program, and it's time well spent, because it can save you hours of troubleshooting later.

The hard way is to let hours or days go by before running it. This is hard for two reasons:

  1. It finds only one mistake at a time, so you need to keep re-runing it.
  2. The error message will probably be "unmatched tag", and the line number might be very far from the real cause of the problem. For instance, in a well formed XML file, on line 30 I put a <p> instead of a <p/>. As you know, <p>, with no text or closing tag, is 1995 HTML, not well formed XML. Well, the checker told me there was an unmatched tag on line 315, the </body> tag. It had to wait that long to make sure I wouldn't put a </p> tag to close the opening tag on line 30. Please remember, this program checks well-formedness, not validity, so it has no way of knowing the Xhtml DTD says it's illegal for one paragraph to contain another one.

lxml Elements and ElementTrees

By Steve Litt

This is a pretty advanced topic, so if you want to skip it, you can. But I want it right up near the top of this document because this section uncovers many lxml landmines that are hard to uncover with a web search. Many, perhaps most websites discussing lxml don't explicitly and often point out the distinction between Element and ElementTree, and that can cause lots of grief if you're not intimately familiar

#!/usr/bin/python
from __future__ import print_function
import re
import copy
from lxml import etree
import sys

def make_skel_tree(root_org):
    root_tree_org = root_org.getroottree()        # Get whole tree
    new_root_tree = copy.deepcopy(root_tree_org)  # Deep copy whole tree
                                                  # INCLUDING doctype
    new_root = new_root_tree.getroot()            # Now work on the root element
    
    ### MANIPULATE ROOT ELEMENT IN VARIOUS WAYS
    xmlns = re.sub('}.*', '', new_root.tag)
    xmlns = re.sub('^{', '', xmlns) 
    bare_root_tag = re.sub('.*}', '', new_root.tag)
    head = new_root.find('{}{}{}head'.format('{', xmlns, '}'))
    body = new_root.find('{}{}{}body'.format('{', xmlns, '}'))
    for obj in body:
        body.remove(obj)
    return(new_root_tree)   # Pass back WHOLE TREE, so doctype gets passed


doc = etree.parse(sys.argv[1])
root = doc.getroot()
new_tree = make_skel_tree(root)
new_tree.write(sys.stdout, pretty_print = True, xml_declaration=True,  encoding=new_tree.docinfo.encoding)

Notice the first line of the main program uses etree.parse(sys.argv[1]) to parse the file named in the argv1 program into an ElementTree object. Not an Element, but anElementTree object. An ElementTree is an entire tree. An Element is one single element.

On the second line of the main routine, the doc.getroot() function returns the root Element to root. Always remember that you can convert back and forth between whole trees and the root element like this:

Now let's look at function make_skel_tree(), whose job is to make a duplicate tree whose <head< section is identical to that of the original, but whose <body> section is empty. This is basically how you make chapter Xhtml files for an ePub book. It's not done in the most time, CPU or memory efficient manner: It would be much more efficient to build up a new one rather than duplicating the old one and then deleting its body, but this implementation is quick and dirty and simple and it illustrates quite a few points.

The first line of function make_skel_tree() uses tree_root = root_element.getroottree() to return the whole tree, as was previously discussed. The second line makes a completely new copy of the tree, using a deep copy. As you know, for most Python lists and dictionaries, mere assignment doesn't make a completely new copy. In fact, if one tries to copy by assignment and then changes the new copy, often the old copy changes too, because in Python, for lists and dictionaries, assignment merely adds a second name to the same data, such that when you change the data using either name, the other data changes too. I've read in places that the lxml package makes deep copy unnecessary, but until I learn that for sure, I'll use copy.deepcopy() to make sure.

There's a much more important, much tricker, and much more obscure point made by the second line. Do you notice I copied the tree, and not just the root element? There's a vital reason for this. Contrary to how things are expressed by most lxml documentation, the root element is not the only element at the top level of the tree. The <DOCTYPE> tag is also at the top, a sibling of the root element, preceding the root element. Always remember that, or you'll drive yourself crazy trying to put a <DOCTYPE> into the root element, a task that's impossible because they're siblings.

The first two lines under the comment in function make_skel_tree() deduce the namespace from the tag of the root element. The way this works is that the root element's tag isn't just html, it's {http://www.w3.org/1999/xhtml}html. The {http://www.w3.org/1999/xhtml} is the namespace of the root element:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

In fact, the tag of every element in the tree has {http://www.w3.org/1999/xhtml} prepended to its name. This will doubtlessly confound you from time to time when trying to find an element of a certain type, or iterating through all children or all descendents and looking for specific names like head, body or h1. So when you're looking for something and you know it's there, always remember to search for the combination of the namespace and the name. Anyway, this prepended namespace is a good way to find the namespace for the whole tree: just use regex.

The rest of the function is pretty self-explanatory: Deducing the bare tag name with regex, using find() on an element to find specific children, iterating through the body to remove all its children.

The final gotcha in this is the last line in the main routine:

new_tree.write(
    sys.stdout, 
    pretty_print = True, 
    xml_declaration=True,  
    encoding=new_tree.docinfo.encoding
    )

The only way I've found to put the at the top of the XML is when you print it, by way of ElementTree.write(), using the xml_declaration=True argument to that function, and telling the encoding by the encoding= argument to tell the encoding. In this case, the code had earlier extracted the ElementTree.docinfo.encoding property.

Which brings up the final lesson of this section: ElementTree.docinfo. This is a Python dictionary containing all sorts of information about the tree. You can do this from within Python:

print(root_tree.docinfo)

Or you can do this from within Python:

print('URL           ' +              str(tree_root.docinfo.URL))
print('doctype       ' +              str(tree_root.docinfo.doctype))
print('encoding      ' +              str(tree_root.docinfo.encoding))
print('externalDTD   ' +              str(tree_root.docinfo.externalDTD))
print('internalDTD   ' +              str(tree_root.docinfo.internalDTD))
print('public_id     ' +              str(tree_root.docinfo.public_id))
print('root_name     ' +              str(tree_root.docinfo.root_name))
print('standalone    ' +              str(tree_root.docinfo.standalone))
print('system_url    ' +              str(tree_root.docinfo.system_url))
print('xml_version   ' +              str(tree_root.docinfo.xml_version))

Building a Skeleton From Scratch

By Steve Litt

x
def build_skel_tree(root_org):
    root_tree_org = root_org.getroottree()        # Get whole tree
    nsstring = re.sub('}.*', '}', root_org.tag)
    tagbase = re.sub(nsstring, '', root_org.tag)
    xmlns = re.sub('[{}]', '', nsstring)
    org_head = root_org.find(nsstring + 'head')

    attstring = ''
    for key in root_org.attrib:
        val = root_org.attrib[key]
        key = re.sub('{http://www.w3.org/XML/1998/namespace}', 'xml:', key)
        attstring = attstring + ' {}="{}"'.format(key, val)
    doctype = root_tree_org.docinfo.doctype
    xstr = '{}\n<{} xmlns="{}" {}/>'.format(doctype, tagbase, xmlns, attstring)
    root = etree.XML(xstr)
    root.append(org_head)
    root.append(etree.Element(nsstring + 'body'))
    root_tree = root.getroottree()
    return(root_tree)
x

Inserting a Doctype in an Existing Document

By Steve Litt

Things get challenging if all you want to do is insert a doctype in an existing document, without deep copying and reducing, or building from scratch. One of the challenges is that info on the web is scarce and contradictory. So this article gives you a method I tech edited and know for a fact works.

Pretty Printing When Pretty Printing Doesn't Work

By Steve Litt

If you travel much around lxml, you'll find that sometimes various pretty_print functionalities don't work as advertised, but instead print long lines with tag after tag after tag. This section gives you the solution. First, let's examine the cause, which you can see at http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output.

Basically, pretty_print functions refuse to reformat if it involves changing spaces between tags, because those spaces might actually be part of the document's text content. However, if the parser involved has already removed blank text, the pretty_print function will then have the courage to add newlines and spaces between tags.

So what you need to do is re-parse the tree using a parser that removes blank text, and pretty print the resulting string. Here's a generic function that takes an XML tree (ElementTree) and pretty prints it to a string:

def tree2prettystring(t, xml_declaration=True):
    st = etree.tostring(t)
    parser = etree.XMLParser(remove_blank_text=True)
    r = etree.XML(st, parser)
    t = r.getroottree()
    st = etree.tostring(t, pretty_print = True, xml_declaration=xml_declaration)
    return(st)

Is this inefficient with large XML trees? Of course. But unless your documents are *huge*, in these days when RAM is measured in gigabytes and processors are measured in gigaHz, is this really worth worrying about on your final printouts?