xml.etree.ElementTree vs. lxml.etree: different internal node representation?
xml.etree.ElementTree vs. lxml.etree: different internal node representation?
I have been transforming some of my original xml.etree.ElementTree
(ET
) code to lxml.etree
(lxmlET
). Luckily there are a lot of similarities between the two. However, I did stumble upon some strange behaviour that I cannot find written down in any documentation. It considers the internal representation of descendant nodes.
xml.etree.ElementTree
ET
lxml.etree
lxmlET
In ET, iter()
is used to iterate over all descendants of an Element, optionally filtered by tag name. Because I could not find any details about this in the documentation, I expected similar behaviour for lxmlET. The thing is that from testing I conclude that in lxmlET, there is a different internal representation of a tree.
iter()
In the example below, I iterate over nodes in a tree and print each node's children, but in addition I also create all different combinations of those children and print those. This means, if an element has children ('A', 'B', 'C')
I create alterations, namely trees [('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')]
.
('A', 'B', 'C')
[('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')]
# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy
def get_combination_trees(tree):
children = list(tree)
for i in range(1, len(children)):
for combination in combinations(children, i):
new_combo_tree = ET.Element(tree.tag, tree.attrib)
for recombined_child in combination:
new_combo_tree.append(recombined_child)
# when using lxml a deepcopy is required to make this work (or make change in parse_xml)
# new_combo_tree.append(deepcopy(recombined_child))
yield new_combo_tree
return None
def parse_xml(tree_p):
for node in ET.fromstring(tree_p):
if not node.tag == 'node_main':
continue
# replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
for subnode in node.iter('node'):
children = list(subnode)
if children:
print('-'.join([child.attrib['id'] for child in children]))
else:
print(f'node {subnode.attrib["id"]} has no children')
for combo_tree in get_combination_trees(subnode):
combo_children = list(combo_tree)
if combo_children:
print('-'.join([child.attrib['id'] for child in combo_children]))
return None
s = '''<root>
<node_main>
<node id="1">
<node id="2" />
<node id="3">
<node id="4">
<node id="5" />
</node>
<node id="6" />
</node>
</node>
</node_main>
</root>
'''
parse_xml(s)
The expected output here is the id's of the children of each node joined together with a hyphen, and also all possible combinations of the children (cf. supra) in a top-down breadth-first fashion.
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children
However, when you use the lxml
module instead of xml
(uncomment the import for lxmlET and comment the import for ET), and run the code you'll see that the output is
lxml
xml
2-3
2
3
node 2 has no children
So the deeper descendant nodes are never visited. This can be circumvented by either:
deepcopy
get_combination_trees()
for subnode in node.xpath('.//node')
parse_xml()
iter()
So I know that there is a way around this, but I am mainly wondering what is happening?! It took me ages to debug this, and I can't find any documentation on it. What is going on, what is the actual underlying difference here between the two modules? And what is the most efficient work-around when working with very large trees?
new_combo_tree
@CristiFati Unfortunately I'm afraid that my python skills are not good enough to debug this further, nor do I have the time. Perhaps someone with knowledge of the topic can find an answer more quickly. Also, in my specific case I need different combinations of the tree to derive some patterns from their xml attributes, however that is not relevant for this question.
– Bram Vanroy
Jun 8 at 6:43
Here is what I think happens. In a compliant DOM API any one node can only exist in exactly one spot. If you append it to another node, it will automatically be removed from its previous parent. That's what lxml (correctly) does, and ET doesn't seem to do. If you need the same node in multiple spots in the same tree, the right thing would be to use
deepcopy
to create node copies (compare towards the end of "Elements are Lists" in the docs). If you just need combinations and want optimal performance, work with lists of node references.– Tomalak
Jun 10 at 9:16
deepcopy
@Tomalak But why then does it behave differently when using xpath instead of iter? (in which a copy is not needed?) Does that work with node references, then?
– Bram Vanroy
Jun 10 at 15:16
3 Answers
3
While Louis's answer is correct and I completely agree that modifying a data structure as you traverse it generally a Bad Idea(tm), you also asked why the code works with xml.etree.ElementTree
and not lxml.etree
and there is a very reasonable explanation for that.
xml.etree.ElementTree
lxml.etree
.append
xml.etree.ElementTree
This library is implemented directly in Python and could vary depending on which Python runtime you're using. Assuming you're using CPython, the implementation you're looking for is implemented in vanilla Python:
def append(self, subelement):
"""Add *subelement* to the end of this element.
The new element will appear in document order after the last existing
subelement (or directly after the text, if it's the first subelement),
but before the end tag for this element.
"""
self._assert_is_element(subelement)
self._children.append(subelement)
The last line is the only part we're concerned with. As it turns out, self._children
is initialized towards the top of that file as:
self._children
self._children =
So adding a child to a tree is just appending an element to a list. Intuitively, that's exactly what you're looking for (in this case) and the implementation behaves in a completely unsurprising way.
.append
lxml.etree
lxml
is implemented as a mix of Python, non-trivial Cython, and C code so spelunking through it was significantly harder than the pure-Python implementation. First off, .append
is implemented as:
lxml
.append
def append(self, _Element element not None):
u"""append(self, element)
Adds a subelement to the end of this element.
"""
_assertValidNode(self)
_assertValidNode(element)
_appendChild(self, element)
_appendChild
is implemented over in apihelper.pxi
:
_appendChild
apihelper.pxi
cdef int _appendChild(_Element parent, _Element child) except -1:
u"""Append a new child to a parent element.
"""
c_node = child._c_node
c_source_doc = c_node.doc
# prevent cycles
if _isAncestorOrSame(c_node, parent._c_node):
raise ValueError("cannot append parent to itself")
# store possible text node
c_next = c_node.next
# move node itself
tree.xmlUnlinkNode(c_node)
tree.xmlAddChild(parent._c_node, c_node)
_moveTail(c_next, c_node)
# uh oh, elements may be pointing to different doc when
# parent element has moved; change them too..
moveNodeToDocument(parent._doc, c_source_doc, c_node)
return 0
There's definitely a bit more going on here. In particular, lxml
explicitly removes the node from the tree and then adds it elsewhere. This prevents you from accidentally creating a cyclic XML graph while manipulating nodes (which is something you could probably do with the xml.etree
version).
lxml
xml.etree
lxml
Now that we know that xml.etree
copies nodes when appending but lxml.etree
moves them, why do those workarounds work? Based on the tree.xmlUnlinkNode
method (which is actually defined in C inside of libxml2
), unlinking just messes with a bunch of pointers. So, anything that copies node metadata will do the trick. Because all of the metadata we care about are direct fields on the xmlNode
struct, anything that shallow copies nodes will do the trick
xml.etree
lxml.etree
tree.xmlUnlinkNode
libxml2
xmlNode
copy.deepcopy()
node.xpath
copy.copy()
new_combo_tree =
xml.etree
If you're really concerned about performance and large trees, I'd probably start with shallow copying with copy.copy()
although you should absolutely profile a few different options and see which one works best for you.
copy.copy()
I'm assuming copy.copy() is less memory-heavy than copy.deepcopy(), so just to be sure; you are certain that this maintains the information that I need? What exactly does it do? Does copy() just keep a pointer to the original node, whereas deepcopy() makes an actual copy?
– Bram Vanroy
Jun 10 at 19:56
lxml
's proxies are not special to the .xpath()
method. lxml
creates proxy elements for all elements that it exposes, not just elements returned from an XPath evaluation. Just put a breakpoint on _elementFactory
and run ET.fromstring("<doc/>")
without any other code. You'll see the factory being called once, even though XPath is not involved at all in this code. Moreover, the proxies do not do shallow copies. That's not their job. If they did then calling .node()
in the OP's code would work just as well as calling .xpath()
.– Louis
Jun 11 at 11:13
lxml
.xpath()
lxml
_elementFactory
ET.fromstring("<doc/>")
.node()
.xpath()
@BramVanroy using
deepcopy()
or copy()
boils down to the same work being done behind the scenes because lxml
defines a __deepcopy__
method that calls __copy__
. See that bit. Using copy
is slightly faster because you skip a call.– Louis
Jun 11 at 11:22
deepcopy()
copy()
lxml
__deepcopy__
__copy__
copy
In general, the safe thing to do when you are manipulating an XML tree and want to copy information in multiple places in the tree (by opposition to moving information from one place to another) is to perform a deep copy operation on those elements rather than just add them to their new location. The vast majority of XML parsing libraries that produce trees require you to perform a deep copy if you want to copy structures around. They just won't give you the results you want if you do not deep copy. lxml
is one such library that requires you to deep copy the structures you want to copy.
lxml
The fact that xml.etree.ElementTree
works in a way such that .append
effectively allows you to have the same element in two places in the tree is definitely unusual in my experience.
xml.etree.ElementTree
.append
You mentioned that for subnode in node.xpath('.//node')
also solves you problem. Note that if you use for subnode in list(node.iter('node'))
, you'll get the same result. What is going on here is that using list(node.iter('node'))
or node.xpath('.//node')
or using deepcopy
to copy the nodes instead of moving them protect you against another problem with your code: you are walking a structure while modifying it.
for subnode in node.xpath('.//node')
for subnode in list(node.iter('node'))
list(node.iter('node'))
node.xpath('.//node')
deepcopy
node.iter('node')
creates an iterator that goes over the XML structure as you iterate it. If you wrap it in list()
, then the structure is walked immediately and the result put in a list. So you've effectively taken a snapshot of the structure before you walk it. That prevents your walking operation from being affected by changes to the tree. If you do node.xpath('.//node')
you are also obtaining a snapshot of the tree before you walk it because that method returns a list of nodes. And if you do a deepcopy
of the nodes and append the copy of the node instead of appending the original node, then you are not modifying the tree you are walking while you are walking it.
node.iter('node')
list()
node.xpath('.//node')
deepcopy
Whether you can get away with using XPath or using node.xpath('.//node')
instead of using deepcopy
depends on what you plan to do with your combinations. The code you show in your question prints the combinations to the screen as soon you create them. They look fine when you print them, but if you do not use a deepcopy
for creating them, then as soon as you create a new combination, the old one will get messed up because any node that appeared in the old combination and needs to appear in the new one will be moved instead of copied.
node.xpath('.//node')
deepcopy
deepcopy
And what is the most efficient work-around when working with very large trees?
It depends on the specifics of your application and the data you need to parse. You gave one example which is a small document but you ask about "large trees". What applies to small documents does not necessarily transfer to large documents. You can optimize for case X but if case X happens only extremely rarely in real data, then your optimization may not pan out. In some cases, it may actually be harmful.
In one application of mine, I had to replace references to some structures with the structures themselves. A simplified illustration would be a document that contains elements like <define id="...">...</def>
and references like <ref idref="..."/>
. Every instance of ref
would have to be replaced with the define
it points to. In general, this may mean copying a single define
multiple times but sometimes a define
may be referred by only one ref
so one optimization was to detect this and skip the deep copy in those cases where there was only one reference. I got this optimization "for free" because the application already required recording each instance of ref
and define
for other purposes. If I've had to add bookkeeping just for this optimization, it is not clear it would have been worth it.
<define id="...">...</def>
<ref idref="..."/>
ref
define
define
define
ref
ref
define
At the beginning I didn't think there was such a difference (neither did I check), but both @supersam654 and @Louis answers pinpointed it very clearly.
But code that is dependent on internal representation (rather than interface) of stuff that it uses, doesn't seem right (from design PoV) to me. Also, as I was asking in my comment: combo_children
seems totally useless:
combo_children
combo_children
combo_children
combo_children
when things could be easily done:
Apparently, the combo_children
approach was also exposing the behavioral difference between the modules.
combo_children
code_orig_lxml.py:
import lxml.etree as ET
#import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy
def get_combination_trees(tree):
children = list(tree)
for i in range(1, len(children)):
for combination in combinations(children, i):
#new_combo_tree = ET.Element(tree.tag, tree.attrib)
#for recombined_child in combination:
#new_combo_tree.append(recombined_child)
# when using lxml a deepcopy is required to make this work (or make change in parse_xml)
# new_combo_tree.append(deepcopy(recombined_child))
#yield new_combo_tree
yield combination
return None
def parse_xml(tree_p):
for node in ET.fromstring(tree_p):
if not node.tag == 'node_main':
continue
# replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
for subnode in node.iter('node'):
children = list(subnode)
if children:
print('-'.join([child.attrib['id'] for child in children]))
else:
print(f'node {subnode.attrib["id"]} has no children')
#for combo_tree in get_combination_trees(subnode):
for combo_children in get_combination_trees(subnode):
#combo_children = list(combo_tree)
if combo_children:
print('-'.join([child.attrib['id'] for child in combo_children]))
return None
s = '''<root>
<node_main>
<node id="1">
<node id="2" />
<node id="3">
<node id="4">
<node id="5" />
</node>
<node id="6" />
</node>
</node>
</node_main>
</root>
'''
parse_xml(s)
Notes:
Output:
(py36x86_test) e:WorkDevStackOverflowq050749937>"e:WorkDevVEnvspy36x86_testScriptspython.exe" code_orig_lxml.py
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children
While I was investigating, I modified your code further, to:
xml_data.py:
DATA = """<root>
<node_main>
<node id="1">
<node id="2" />
<node id="3">
<node id="4">
<node id="5" />
</node>
<node id="6" />
</node>
</node>
</node_main>
</root>
"""
code.py:
import sys
import xml.etree.ElementTree as xml_etree_et
import lxml.etree as lxml_etree
from itertools import combinations
from xml_data import DATA
MAIN_NODE_NAME = "node_main"
def get_children_combinations(tree):
children = list(tree)
for i in range(1, len(children)):
yield from combinations(children, i)
def get_tree(xml_str, parse_func, tag=None):
root_node = parse_func(xml_str)
if tag:
return [item for item in root_node if item.tag == tag]
return [root_node]
def process_xml(xml_node):
for node in xml_node.iter("node"):
print(f"nNode ({node.tag}, {node.attrib['id']})")
children = list(node)
if children:
print(" Children: " + " - ".join([child.attrib["id"] for child in children]))
for children_combo in get_children_combinations(node):
if children_combo:
print(" Combo: " + " - ".join([child.attrib["id"] for child in children_combo]))
def main():
parse_funcs = (xml_etree_et.fromstring, lxml_etree.fromstring)
for func in parse_funcs:
print(f"nParsing xml using: {func.__module__} {func.__name__}")
nodes = get_tree(DATA, func, tag=MAIN_NODE_NAME)
for node in nodes:
print(f"nProcessing node: {node.tag}")
process_xml(node)
if __name__ == "__main__":
print("Python {:s} on {:s}n".format(sys.version, sys.platform))
main()
Output:
(py36x86_test) e:WorkDevStackOverflowq050749937>"e:WorkDevVEnvspy36x86_testScriptspython.exe" code.py
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32
Parsing xml using: xml.etree.ElementTree XML
Processing node: node_main
Node (node, 1)
Children: 2 - 3
Combo: 2
Combo: 3
Node (node, 2)
Node (node, 3)
Children: 4 - 6
Combo: 4
Combo: 6
Node (node, 4)
Children: 5
Node (node, 5)
Node (node, 6)
Parsing xml using: lxml.etree fromstring
Processing node: node_main
Node (node, 1)
Children: 2 - 3
Combo: 2
Combo: 3
Node (node, 2)
Node (node, 3)
Children: 4 - 6
Combo: 4
Combo: 6
Node (node, 4)
Children: 5
Node (node, 5)
Node (node, 6)
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Both modules have the code publicly available (but I see that lxml is partly written in C). Not related to question: why do you create the
new_combo_tree
element? Note that the behavioral difference doesn't necessarily imply a difference in internal representation.– CristiFati
Jun 8 at 6:37