When I first needed to parse XML in Python, I reached for BeautifulSoup out of habit. Then I discovered that Python ships with a perfectly capable XML parser in the standard library: xml.etree.ElementTree. No pip install required.
Here's everything I've learned about it.
Why ElementTree?
ElementTree provides a simple, Pythonic API for XML. It's:
- Built-in: No external dependencies
- Memory efficient: Parses into a tree structure, not a full DOM
- Easy to use: Elements behave like lists, attributes like dicts
import xml.etree.ElementTree as ETThat import is your starting point for everything that follows.
Parsing XML
From a String
import xml.etree.ElementTree as ET
xml_string = """
<library>
<book isbn="978-0-13-468599-1">
<title>The Pragmatic Programmer</title>
<author>David Thomas</author>
<year>2019</year>
</book>
<book isbn="978-0-596-00712-6">
<title>Head First Design Patterns</title>
<author>Eric Freeman</author>
<year>2004</year>
</book>
</library>
"""
root = ET.fromstring(xml_string)
print(root.tag) # libraryThe fromstring() function returns the root element directly.
From a File
import xml.etree.ElementTree as ET
# Parse returns an ElementTree object
tree = ET.parse('library.xml')
root = tree.getroot()
# Now work with root as usual
for book in root:
print(book.get('isbn'))The difference: parse() returns an ElementTree object (which wraps the root), while fromstring() returns an Element directly. If you need to write back to a file later, keep the tree around.
Iterative Parsing for Large Files
For huge XML files, parsing everything into memory won't work. Use iterparse():
import xml.etree.ElementTree as ET
# Process elements as they're parsed
for event, elem in ET.iterparse('huge_file.xml', events=['end']):
if elem.tag == 'book':
print(elem.find('title').text)
elem.clear() # Free memory after processingThe clear() call is crucial—without it, you're still building the full tree.
Navigating the Tree
Once you have the root element, you need to find things in it.
Direct Children
Elements are iterable. Loop over them to get direct children:
root = ET.fromstring(xml_string)
for child in root:
print(f"{child.tag}: {child.attrib}")
# book: {'isbn': '978-0-13-468599-1'}
# book: {'isbn': '978-0-596-00712-6'}find() - First Match
Returns the first matching element, or None:
# Find first book
first_book = root.find('book')
print(first_book.find('title').text) # The Pragmatic Programmer
# Careful with None!
missing = root.find('magazine')
print(missing) # None
# This will crash:
# missing.find('title') # AttributeErrorAlways check for None when using find().
findall() - All Matches
Returns a list of all matching elements:
books = root.findall('book')
print(len(books)) # 2
for book in books:
title = book.find('title').text
year = book.find('year').text
print(f"{title} ({year})")iter() - All Descendants
Recursively iterate through all descendants with a specific tag:
# All title elements, anywhere in the tree
for title in root.iter('title'):
print(title.text)
# All elements, period
for elem in root.iter():
print(f"{elem.tag}: {elem.text}")iter() is great when you don't care about tree structure—just give me all the things.
iterfind() - Lazy findall()
Like findall(), but returns an iterator instead of a list:
# Memory-efficient for large results
for book in root.iterfind('book'):
process(book)Element Properties
Every element has these:
book = root.find('book')
# Tag name
book.tag # 'book'
# Text content
book.find('title').text # 'The Pragmatic Programmer'
# Tail (text after closing tag)
book.tail # Usually whitespace
# Attributes (dict-like)
book.attrib # {'isbn': '978-0-13-468599-1'}
book.get('isbn') # '978-0-13-468599-1'
book.get('missing', 'default') # 'default'Text vs Tail
This confused me at first:
<p>Hello <b>world</b>!</p>p = ET.fromstring('<p>Hello <b>world</b>!</p>')
print(p.text) # 'Hello '
b = p.find('b')
print(b.text) # 'world'
print(b.tail) # '!'text is content before the first child. tail is content after the element's closing tag but before the parent's closing tag.
XPath Basics
ElementTree supports a subset of XPath. It's powerful enough for most tasks.
Path Syntax
# Direct child
root.find('book')
# Any descendant
root.find('.//title') # Title anywhere below root
# Specific path
root.find('./book/title')
# Current element (rarely needed)
root.find('.')Attribute Predicates
# Find by attribute value
root.find(".//book[@isbn='978-0-13-468599-1']")
# Element with specific attribute (any value)
root.find(".//book[@isbn]")
# Element WITHOUT an attribute
root.findall(".//book[not(@isbn)]") # Not supported!That last one doesn't work—ElementTree's XPath is limited.
Position Predicates
# First book (1-indexed!)
root.find('.//book[1]')
# Last book
root.find('.//book[last()]')
# Second to last
root.find('.//book[last()-1]')Text Predicates
# Book with specific title text
root.find(".//book[title='The Pragmatic Programmer']")
# Book from a specific year
root.find(".//book[year='2019']")What's NOT Supported
ElementTree's XPath is subset—these don't work:
- Axes like
ancestor::,following-sibling:: - Functions like
contains(),starts-with() - Boolean operators in predicates
- Arithmetic expressions
For full XPath, use lxml:
from lxml import etree
root = etree.fromstring(xml_string)
# Now full XPath works
root.xpath(".//book[contains(title, 'Python')]")Creating XML Documents
Building from Scratch
import xml.etree.ElementTree as ET
# Create root
root = ET.Element('library')
# Add a book
book = ET.SubElement(root, 'book')
book.set('isbn', '978-0-13-468599-1')
# Add book children
title = ET.SubElement(book, 'title')
title.text = 'The Pragmatic Programmer'
author = ET.SubElement(book, 'author')
author.text = 'David Thomas'
# Convert to string
xml_str = ET.tostring(root, encoding='unicode')
print(xml_str)
# <library><book isbn="978-0-13-468599-1"><title>The Pragmatic Programmer</title><author>David Thomas</author></book></library>Pretty Printing (Python 3.9+)
That output is ugly. Fix it:
ET.indent(root, space=" ")
xml_str = ET.tostring(root, encoding='unicode')
print(xml_str)Output:
<library>
<book isbn="978-0-13-468599-1">
<title>The Pragmatic Programmer</title>
<author>David Thomas</author>
</book>
</library>Before Python 3.9, you'd need minidom or a helper function.
Writing to File
tree = ET.ElementTree(root)
# Basic write
tree.write('output.xml')
# With XML declaration and encoding
tree.write(
'output.xml',
encoding='utf-8',
xml_declaration=True
)The file will start with <?xml version='1.0' encoding='utf-8'?>.
Generating from Data
Real-world use case—turning a list of dicts into XML:
def books_to_xml(books):
root = ET.Element('library')
for book_data in books:
book = ET.SubElement(root, 'book')
book.set('isbn', book_data['isbn'])
for field in ['title', 'author', 'year']:
if field in book_data:
elem = ET.SubElement(book, field)
elem.text = str(book_data[field])
return root
books = [
{'isbn': '123', 'title': 'Book One', 'author': 'Alice', 'year': 2020},
{'isbn': '456', 'title': 'Book Two', 'author': 'Bob', 'year': 2021},
]
root = books_to_xml(books)
ET.indent(root)
print(ET.tostring(root, encoding='unicode'))Modifying Existing XML
Changing Text and Attributes
root = ET.fromstring(xml_string)
# Update text
for title in root.iter('title'):
title.text = title.text.upper()
# Update attributes
for book in root.findall('book'):
book.set('updated', '2024-01-15')
# Remove attribute
for book in root.findall('book'):
if 'updated' in book.attrib:
del book.attrib['updated']Adding Elements
# Add as last child
new_book = ET.SubElement(root, 'book')
new_book.set('isbn', '789')
# Insert at specific position
root.insert(0, new_book) # Insert at beginning
# Copy an element
import copy
book_copy = copy.deepcopy(root.find('book'))
root.append(book_copy)Removing Elements
# Remove specific element
for book in root.findall(".//book[@isbn='978-0-596-00712-6']"):
root.remove(book)
# Remove all books from 2004
for book in root.findall('.//book'):
year = book.find('year')
if year is not None and year.text == '2004':
root.remove(book)Gotcha: Don't modify a list while iterating over it!
# WRONG - will skip elements
for book in root:
if some_condition(book):
root.remove(book)
# RIGHT - iterate over a copy
for book in list(root):
if some_condition(book):
root.remove(book)
# OR - collect then remove
to_remove = [b for b in root if some_condition(b)]
for book in to_remove:
root.remove(book)Replacing Elements
old_book = root.find(".//book[@isbn='123']")
new_book = ET.Element('book')
new_book.set('isbn', '123-new')
ET.SubElement(new_book, 'title').text = 'Replacement Book'
# Find index and replace
idx = list(root).index(old_book)
root.remove(old_book)
root.insert(idx, new_book)Handling Namespaces
This is where XML gets annoying.
The Problem
<root xmlns="http://example.com/default"
xmlns:custom="http://example.com/custom">
<item>Default namespace</item>
<custom:item>Custom namespace</custom:item>
</root>Try to find elements normally:
root = ET.fromstring(xml_with_namespaces)
print(root.find('item')) # None! Where did it go?Namespaces change how tags work internally:
for child in root:
print(child.tag)
# {http://example.com/default}item
# {http://example.com/custom}itemThe full tag is {namespace}localname.
Solution: Namespace Dicts
ns = {
'default': 'http://example.com/default',
'custom': 'http://example.com/custom',
}
# Now finding works
root.find('default:item', ns)
root.find('custom:item', ns)
root.findall('.//default:item', ns)Dealing with Default Namespaces
When there's no prefix in the XML (just xmlns=), you still need one in your dict:
<feed xmlns="http://www.w3.org/2005/Atom">
<entry><title>Hello</title></entry>
</feed>ns = {'atom': 'http://www.w3.org/2005/Atom'}
root = ET.fromstring(atom_feed)
entries = root.findall('atom:entry', ns)Creating Namespaced XML
# Register namespace prefix
ET.register_namespace('custom', 'http://example.com/custom')
root = ET.Element('{http://example.com/custom}root')
item = ET.SubElement(root, '{http://example.com/custom}item')
item.text = 'Hello'
print(ET.tostring(root, encoding='unicode'))
# <custom:root xmlns:custom="http://example.com/custom"><custom:item>Hello</custom:item></custom:root>Stripping Namespaces
Sometimes you just want to ignore them:
def strip_namespaces(root):
"""Remove all namespaces from element tags."""
for elem in root.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}')[1]
return root
root = strip_namespaces(root)
# Now root.find('item') worksUse with caution—you lose namespace information.
Common Gotchas
1. find() Returns None Silently
book = root.find('nonexistent')
title = book.find('title') # AttributeError: 'NoneType' has no attribute 'find'Always check:
book = root.find('book')
if book is not None:
title = book.find('title')Or use findall() which returns empty list instead of None.
2. Boolean Testing Elements
# This looks wrong but isn't
elem = root.find('book')
if elem: # False if book has no children!
print("Found")An element with no children is "falsy". Use explicit None check:
if elem is not None:
print("Found")3. Text is None, Not Empty String
empty = ET.fromstring('<tag></tag>')
print(empty.text) # None, not ""
print(empty.text or "") # Safe way to get empty string4. Encoding Gotchas
# tostring returns bytes by default
ET.tostring(root) # b'<root>...</root>'
# For string, specify encoding
ET.tostring(root, encoding='unicode') # '<root>...</root>'5. Modifying During Iteration
# BROKEN
for child in root:
root.remove(child) # Skips elements!
# FIXED
for child in list(root):
root.remove(child)Security: XXE Attacks
This is critical if you parse untrusted XML.
What's XXE?
XML External Entity attacks let attackers read files or make network requests from your server:
<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root>&xxe;</root>When parsed, &xxe; expands to the contents of /etc/passwd.
ElementTree's Default Behavior
Good news: xml.etree.ElementTree ignores DTDs by default, so the basic XXE attack doesn't work:
import xml.etree.ElementTree as ET
malicious = '''<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root>&xxe;</root>'''
root = ET.fromstring(malicious)
print(root.text) # None (entity not expanded)But Other Attacks Exist
- Billion laughs attack (denial of service)
- External DTD fetching
- Parameter entity expansion
The Safe Solution: defusedxml
For untrusted input, use defusedxml:
pip install defusedxmlimport defusedxml.ElementTree as ET
# Same API, but safe
root = ET.fromstring(untrusted_xml)
tree = ET.parse(untrusted_file)defusedxml blocks:
- External entity processing
- DTD retrieval
- Entity expansion attacks
Rule of thumb: If you're parsing user-supplied XML, use defusedxml.
Security Checklist
- ✅ Use
defusedxmlfor untrusted input - ✅ Validate XML against a schema if possible
- ✅ Set reasonable size limits on input
- ✅ Don't use
xml.etree.ElementTreewithXMLParserentities enabled - ✅ Consider JSON instead if you control both ends
Quick Reference
import xml.etree.ElementTree as ET
# Parsing
root = ET.fromstring(xml_string) # From string
tree = ET.parse('file.xml') # From file
root = tree.getroot()
# Navigation
root.find('tag') # First match (or None)
root.findall('tag') # All matches (list)
root.iter('tag') # All descendants (iterator)
root.findall('.//tag') # Descendants via XPath
# Element properties
elem.tag # Tag name
elem.text # Text content
elem.tail # Text after close tag
elem.attrib # Attributes dict
elem.get('attr', default) # Get attribute
# XPath
root.find('./child/grandchild') # Path
root.find('.//tag') # Any descendant
root.find(".//tag[@attr='val']") # By attribute
root.find('.//tag[1]') # By position
# Creating
root = ET.Element('root') # New element
child = ET.SubElement(root, 'child') # Add child
child.text = 'content' # Set text
child.set('attr', 'value') # Set attribute
ET.indent(root) # Pretty print
# Writing
ET.tostring(root, encoding='unicode')
tree = ET.ElementTree(root)
tree.write('out.xml', encoding='utf-8', xml_declaration=True)
# Namespaces
ns = {'prefix': 'http://example.com'}
root.find('prefix:tag', ns)When to Use What
| Task | Tool |
|---|---|
| Simple parsing | xml.etree.ElementTree |
| Large files | iterparse() |
| Pretty printing | minidom or ET.indent() |
| Full XPath | lxml |
| Untrusted XML | defusedxml |
| Speed critical | lxml |
Wrapping Up
ElementTree handles 90% of XML tasks with zero dependencies. The API is intuitive once you understand:
- Elements are list-like (iterate children)
- Attributes are dict-like (
.get(),.set()) find()returns None,findall()returns empty list- Namespaces need explicit handling
- Use defusedxml for untrusted input
Start with fromstring() and find()/findall(). Add complexity only when needed.