I recently needed to save some Python objects between script runs. Not a database situation—just "remember this dictionary when I run again tomorrow." That's when I discovered pickle and shelve, Python's built-in persistence tools. Here's what I learned.
The Problem: Python Objects Are Ephemeral
When your script ends, everything in memory disappears. Variables, objects, computed results—gone. If you want data to survive, you need to persist it somehow.
# This dictionary exists only while the script runs
config = {"theme": "dark", "font_size": 14}
# When Python exits, config vanishesI could write to a JSON file, but what about more complex Python objects like classes, datetime objects, or nested structures? That's where pickle comes in.
Pickle Basics: dump and load
pickle serializes Python objects to bytes. Think of it as "freeze-drying" your data.
import pickle
# dumps = dump to string (bytes)
data = {"name": "Alice", "scores": [95, 87, 92]}
serialized = pickle.dumps(data)
print(type(serialized)) # <class 'bytes'>
print(len(serialized)) # Some number of bytes
# loads = load from string
restored = pickle.loads(serialized)
print(restored) # {'name': 'Alice', 'scores': [95, 87, 92]}For files, use dump and load (no 's'):
import pickle
data = {"key": "value", "numbers": [1, 2, 3]}
# Write to file (binary mode!)
with open("data.pkl", "wb") as f:
pickle.dump(data, f)
# Read from file
with open("data.pkl", "rb") as f:
loaded = pickle.load(f)The b in wb and rb is crucial—pickle produces binary data, not text.
⚠️ The Security Warning Nobody Should Skip
Here's the thing that surprised me: pickle can execute arbitrary code when loading. This isn't a bug—it's how pickle handles reconstructing complex objects.
import pickle
import os
class Evil:
def __reduce__(self):
# This runs os.system when unpickled!
return (os.system, ("echo YOU'VE BEEN HACKED",))
# Pickling is fine
evil_bytes = pickle.dumps(Evil())
# But loading runs the command!
# pickle.loads(evil_bytes) # DON'T DO THIS with untrusted dataThe __reduce__ method tells pickle how to reconstruct an object. Attackers can craft pickle data that runs any code when loaded.
Rule of thumb: Never unpickle data from untrusted sources. Treat .pkl files from the internet like executable files.
If you need to accept external serialized data, use JSON or MessagePack instead.
Pickle Protocol Versions
Pickle has evolved over time. Each protocol version adds features and improves performance.
import pickle
data = {"test": True}
# Protocol 0: ASCII, human-readable (slow, legacy)
p0 = pickle.dumps(data, protocol=0)
print(p0) # You can sort of read it
# Protocol 4: Python 3.4+ (default in Python 3.8+)
p4 = pickle.dumps(data, protocol=4)
# Protocol 5: Python 3.8+ (efficient for large buffers)
p5 = pickle.dumps(data, protocol=5)
# Use highest available for best performance
best = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
# Check your default
print(f"Default protocol: {pickle.DEFAULT_PROTOCOL}")In practice, I just use HIGHEST_PROTOCOL unless I need compatibility with older Python versions.
Custom Pickling with getstate and setstate
Not everything can be pickled. File handles, database connections, sockets—these don't serialize. But you can control what gets pickled using special methods.
import pickle
class DatabaseConnection:
def __init__(self, host, port):
self.host = host
self.port = port
self.socket = self._connect() # Can't pickle sockets
def _connect(self):
print(f"Connecting to {self.host}:{self.port}")
return f"socket-{self.host}" # Fake socket
def __getstate__(self):
# Called when pickling - return what to save
state = self.__dict__.copy()
del state["socket"] # Remove unpicklable part
return state
def __setstate__(self, state):
# Called when unpickling - restore from saved state
self.__dict__.update(state)
self.socket = self._connect() # Reconnect
# Test it
conn = DatabaseConnection("localhost", 5432)
data = pickle.dumps(conn)
restored = pickle.loads(data) # Prints "Connecting to..."
print(restored.host) # localhostThis pattern is essential for objects with external resources. Save the configuration, reconnect on restore.
A simpler example: excluding computed values
import pickle
class Rectangle:
def __init__(self, width, height):
self.width = width
self.height = height
self._area = width * height # Cached computation
def __getstate__(self):
# Don't persist the cached value
state = self.__dict__.copy()
del state["_area"]
return state
def __setstate__(self, state):
self.__dict__.update(state)
self._area = self.width * self.height # Recompute
rect = Rectangle(10, 20)
restored = pickle.loads(pickle.dumps(rect))
print(restored._area) # 200Enter shelve: Dict-Like Persistence
pickle is the engine; shelve is the convenient interface. It gives you a persistent dictionary backed by pickle.
import shelve
# Open a shelf (creates database files)
with shelve.open("mydata") as db:
db["name"] = "Alice"
db["scores"] = [95, 87, 92]
db["config"] = {"debug": True, "version": 1}
# Later, in another script run...
with shelve.open("mydata") as db:
print(db["name"]) # Alice
print(db["scores"]) # [95, 87, 92]No explicit serialization—just assign and read like a dict. The data persists across program runs.
The Writeback Gotcha
This tripped me up. By default, modifying objects retrieved from a shelf doesn't persist:
import shelve
# Store a list
with shelve.open("mydata") as db:
db["items"] = [1, 2, 3]
# Try to append
with shelve.open("mydata") as db:
db["items"].append(4) # This looks like it should work...
# Check what we have
with shelve.open("mydata") as db:
print(db["items"]) # [1, 2, 3] - The 4 is GONE!What happened? When you read db["items"], shelve unpickles a copy. You modified the copy, then shelve threw it away.
The fix is writeback=True:
import shelve
with shelve.open("mydata", writeback=True) as db:
db["items"] = [1, 2, 3]
db["items"].append(4) # Now this persists
with shelve.open("mydata") as db:
print(db["items"]) # [1, 2, 3, 4]With writeback, shelve caches all accessed objects and writes them back on close. Trade-off: uses more memory and close() is slower.
Shelve Limitations
A few things to know:
import shelve
with shelve.open("mydata") as db:
# Keys MUST be strings
# db[123] = "value" # TypeError!
db["123"] = "value" # OK
# Not thread-safe by default
# Values must be picklableAlternatives: When to Use What
After learning pickle and shelve, I realized they're not always the right choice.
JSON: Safe and Portable
import json
data = {"name": "Alice", "age": 30}
# Serialize to string
json_str = json.dumps(data)
# Works across languages, safe to load
restored = json.loads(json_str)Use JSON when:
- Data comes from or goes to external sources
- You need cross-language compatibility
- You only have basic types (dict, list, str, int, float, bool, None)
MessagePack: Fast Binary
# pip install msgpack
import msgpack
data = {"name": "Alice", "scores": [95, 87, 92]}
# Smaller and faster than JSON
packed = msgpack.packb(data)
restored = msgpack.unpackb(packed)Use msgpack when:
- You need speed and small size
- Still want safety (no code execution)
- Basic types only, but binary format is OK
When to Use Pickle/Shelve
Good for:
- Local caching of computed results
- Quick prototyping
- Development and debugging
- Saving/loading your own program's state
Avoid for:
- Data from untrusted sources (security risk!)
- Long-term storage (schema changes break it)
- Cross-language systems (Python-only)
- Network protocols (use JSON/protobuf)
My Decision Framework
Here's how I now decide what to use:
Is the data from an untrusted source?
├── Yes → Use JSON or msgpack, NEVER pickle
└── No → Continue...
Do I need complex Python objects (classes, datetime, etc.)?
├── Yes → Use pickle/shelve
└── No → Continue...
Do I need dict-like persistence?
├── Yes → Use shelve
└── No → Use pickle
Am I storing this long-term (months/years)?
├── Yes → Consider SQLite or a proper database
└── No → pickle/shelve is fine
Putting It Together: A Simple Cache
Here's a practical example combining what I learned:
import shelve
import time
import hashlib
class SimpleCache:
def __init__(self, path="cache", ttl=3600):
self.path = path
self.ttl = ttl
def _hash_key(self, key):
# Ensure string key for shelve
return hashlib.md5(str(key).encode()).hexdigest()
def get(self, key):
cache_key = self._hash_key(key)
with shelve.open(self.path) as db:
if cache_key in db:
entry = db[cache_key]
if time.time() - entry["time"] < self.ttl:
return entry["value"]
return None
def set(self, key, value):
cache_key = self._hash_key(key)
with shelve.open(self.path) as db:
db[cache_key] = {
"value": value,
"time": time.time()
}
# Usage
cache = SimpleCache()
cache.set("expensive_computation", [1, 2, 3, 4, 5])
result = cache.get("expensive_computation")What I Learned
- pickle is powerful but dangerous - Never load untrusted pickle data
- shelve is just pickle with a dict interface - Convenient for key-value persistence
- writeback mode matters - Without it, in-place modifications are lost
- JSON for external data - Safe, portable, universally understood
- Protocol versions matter - Use
HIGHEST_PROTOCOLfor best performance - getstate/setstate for custom objects - Control what gets serialized
For quick local persistence, shelve is hard to beat. But the moment external data or security enters the picture, switch to JSON or msgpack.