I recently dove into Python threading and concurrency, and honestly, it was confusing at first. The GIL? Locks? Race conditions? It felt like learning a new language. But after banging my head against it for a while, things started to click. Here's what I learned.
The GIL: Why Python Threading Is... Weird
Before writing any code, you need to understand the Global Interpreter Lock (GIL). This confused me for weeks.
The GIL is a mutex that protects access to Python objects. Only one thread can execute Python bytecode at a time, even on a multi-core machine.
# Even with 4 threads, only ONE runs Python code at a time
# The GIL switches between them rapidlyThis sounds terrible, right? Why even have threading? Here's the key insight that finally made it click for me:
The GIL is released during I/O operations. When your thread is waiting for a network response or file read, it releases the GIL and another thread can run.
# Thread 1: makes HTTP request, releases GIL while waiting
# Thread 2: can now run! Makes its own request
# Thread 3: also waiting on I/O, GIL released
# All three can "wait" simultaneouslyThis is why threading works great for I/O-bound tasks but not CPU-bound ones. We'll come back to this.
Threading Module Basics
Let's start simple. The threading module is Python's high-level threading interface.
import threading
import time
def worker(name):
print(f"{name} starting")
time.sleep(2) # Simulates I/O - GIL is released!
print(f"{name} finished")
# Create threads
t1 = threading.Thread(target=worker, args=("Thread-1",))
t2 = threading.Thread(target=worker, args=("Thread-2",))
# Start them
t1.start()
t2.start()
# Wait for completion
t1.join()
t2.join()
print("All done!")Output:
Thread-1 starting
Thread-2 starting
Thread-1 finished # Both finish around the same time!
Thread-2 finished
All done!
Both threads sleep concurrently, so this takes ~2 seconds total, not 4.
The Thread Class
You can subclass Thread for more complex scenarios:
import threading
import time
class DownloadThread(threading.Thread):
def __init__(self, url):
super().__init__()
self.url = url
self.result = None
def run(self):
# This method is called when you call .start()
print(f"Downloading {self.url}")
time.sleep(1) # Simulate download
self.result = f"Data from {self.url}"
# Usage
threads = [
DownloadThread("https://api.example.com/users"),
DownloadThread("https://api.example.com/posts"),
DownloadThread("https://api.example.com/comments"),
]
for t in threads:
t.start()
for t in threads:
t.join()
for t in threads:
print(t.result)I prefer the function-based approach for simple cases, but subclassing is nice when threads need to return values or maintain state.
Daemon Threads
Daemon threads are background threads that die when your main program exits:
import threading
import time
def background_task():
while True:
print("Background working...")
time.sleep(1)
# Regular thread - program waits for it
t = threading.Thread(target=background_task)
t.daemon = True # Make it a daemon
t.start()
time.sleep(3)
print("Main thread exiting")
# Daemon thread is killed automaticallyUse daemons for cleanup tasks, monitoring, or anything that shouldn't keep your program alive.
Locks and Synchronization
Here's where I made my first big mistake. I had multiple threads updating a shared counter:
import threading
counter = 0
def increment():
global counter
for _ in range(100000):
counter += 1 # NOT thread-safe!
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
t.start()
for t in threads:
t.join()
print(counter) # Expected: 500000, Actual: something less!The problem? counter += 1 isn't atomic. It's actually:
- Read counter
- Add 1
- Write counter
If two threads read the same value before either writes, you lose an increment. This is a race condition.
Using Locks
The fix is a Lock:
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
for _ in range(100000):
with lock: # Only one thread can hold this at a time
counter += 1
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
t.start()
for t in threads:
t.join()
print(counter) # Always 500000The with lock: syntax acquires the lock, runs your code, then releases it. Even if an exception occurs, the lock is released.
RLock (Reentrant Lock)
A regular Lock can't be acquired twice by the same thread - it'll deadlock. RLock can:
import threading
lock = threading.RLock()
def outer():
with lock:
print("In outer")
inner() # This acquires the same lock
def inner():
with lock: # With Lock, this would deadlock!
print("In inner")
outer()I use RLock when I have nested function calls that all need the lock.
Other Synchronization Primitives
import threading
# Semaphore - allows N threads at once
semaphore = threading.Semaphore(3)
def limited_worker():
with semaphore:
# Only 3 threads run this at a time
do_work()
# Event - simple flag for signaling
event = threading.Event()
def waiter():
print("Waiting...")
event.wait() # Blocks until event is set
print("Done waiting!")
def setter():
time.sleep(2)
event.set() # Unblocks all waiters
# Condition - more complex waiting
condition = threading.Condition()
def consumer():
with condition:
condition.wait() # Wait for notification
process_item()
def producer():
with condition:
add_item()
condition.notify() # Wake up one waiterThread Pools with concurrent.futures
Creating individual threads works, but managing them is tedious. ThreadPoolExecutor is much cleaner:
from concurrent.futures import ThreadPoolExecutor
import time
def download(url):
print(f"Downloading {url}")
time.sleep(1)
return f"Data from {url}"
urls = [
"https://example.com/1",
"https://example.com/2",
"https://example.com/3",
"https://example.com/4",
]
# Use a pool of 3 workers
with ThreadPoolExecutor(max_workers=3) as executor:
results = executor.map(download, urls)
for result in results:
print(result)The pool reuses threads, handles the join logic, and provides a clean interface.
Submitting Individual Tasks
For more control, use submit():
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
def slow_task(n):
time.sleep(n)
return f"Task {n} done"
with ThreadPoolExecutor(max_workers=3) as executor:
# Submit tasks
futures = {executor.submit(slow_task, i): i for i in [3, 1, 2]}
# Process as they complete (not in submission order!)
for future in as_completed(futures):
task_id = futures[future]
result = future.result()
print(f"Task {task_id}: {result}")Output:
Task 1: Task 1 done
Task 2: Task 2 done
Task 3: Task 3 done
as_completed() yields futures as they finish, which is great for handling results as soon as they're ready.
Error Handling
from concurrent.futures import ThreadPoolExecutor
def might_fail(n):
if n == 2:
raise ValueError("I don't like 2")
return n * 2
with ThreadPoolExecutor(max_workers=3) as executor:
futures = [executor.submit(might_fail, i) for i in range(5)]
for i, future in enumerate(futures):
try:
result = future.result()
print(f"Task {i}: {result}")
except Exception as e:
print(f"Task {i} failed: {e}")When Threading Helps (and When It Doesn't)
This is the most important section. I wasted hours trying to speed up CPU-bound code with threads.
I/O-Bound: Threading Wins
I/O-bound means your code spends most of its time waiting for external things:
- Network requests (APIs, databases)
- File reads/writes
- User input
from concurrent.futures import ThreadPoolExecutor
import requests
import time
def fetch(url):
response = requests.get(url)
return len(response.content)
urls = ["https://example.com"] * 10
# Sequential: slow
start = time.time()
results = [fetch(url) for url in urls]
print(f"Sequential: {time.time() - start:.2f}s")
# Threaded: fast!
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch, urls))
print(f"Threaded: {time.time() - start:.2f}s")Sequential: 5.23s
Threaded: 0.68s
When one thread waits for the network, others run. Big win.
CPU-Bound: Threading Fails
CPU-bound means your code does heavy computation:
- Number crunching
- Image processing
- Data parsing
from concurrent.futures import ThreadPoolExecutor
import time
def cpu_intensive(n):
# Simulates heavy computation
total = 0
for i in range(n):
total += i * i
return total
# Sequential
start = time.time()
results = [cpu_intensive(10_000_000) for _ in range(4)]
print(f"Sequential: {time.time() - start:.2f}s")
# Threaded - NOT faster due to GIL!
start = time.time()
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(cpu_intensive, [10_000_000] * 4))
print(f"Threaded: {time.time() - start:.2f}s")Sequential: 3.42s
Threaded: 3.51s # Same or worse!
The GIL means only one thread runs Python code at a time. For CPU-bound work, use multiprocessing or ProcessPoolExecutor instead:
from concurrent.futures import ProcessPoolExecutor
# This DOES parallelize CPU work
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(cpu_intensive, [10_000_000] * 4))Common Pitfalls
Here are mistakes I made so you don't have to:
1. Forgetting to Join
# Bad - program might exit before thread finishes
t = threading.Thread(target=long_task)
t.start()
# No join!
# Good
t = threading.Thread(target=long_task)
t.start()
t.join() # Wait for completion2. Sharing Mutable State Without Locks
# Bad - race condition
shared_list = []
def append_items():
for i in range(100):
shared_list.append(i) # Can cause issues!
# Good - use a lock
lock = threading.Lock()
shared_list = []
def append_items():
for i in range(100):
with lock:
shared_list.append(i)
# Better - use thread-safe structures
from queue import Queue
shared_queue = Queue()
def append_items():
for i in range(100):
shared_queue.put(i) # Thread-safe!3. Deadlocks
# Deadlock! Thread 1 has lock_a, waits for lock_b
# Thread 2 has lock_b, waits for lock_a
lock_a = threading.Lock()
lock_b = threading.Lock()
def thread1():
with lock_a:
time.sleep(0.1)
with lock_b: # Waits forever
pass
def thread2():
with lock_b:
time.sleep(0.1)
with lock_a: # Waits forever
pass
# Fix: Always acquire locks in the same order
def thread1_fixed():
with lock_a:
with lock_b:
pass
def thread2_fixed():
with lock_a: # Same order!
with lock_b:
pass4. Using Threading for CPU-Bound Work
See the section above. Use multiprocessing for CPU work.
5. Too Many Threads
# Bad - thousands of threads = overhead
with ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(fetch, urls)
# Good - reasonable pool size
# Rule of thumb: 2-4x CPU cores for I/O-bound
import os
workers = min(32, os.cpu_count() * 4)
with ThreadPoolExecutor(max_workers=workers) as executor:
results = executor.map(fetch, urls)6. Not Handling Exceptions in Threads
# Bad - exception in thread silently disappears
def worker():
raise ValueError("Oops!")
t = threading.Thread(target=worker)
t.start()
t.join()
# No error shown!
# Good - catch exceptions in the thread
def worker():
try:
raise ValueError("Oops!")
except Exception as e:
print(f"Error in thread: {e}")
# Or use ThreadPoolExecutor and check results
with ThreadPoolExecutor() as executor:
future = executor.submit(worker)
try:
future.result() # Raises the exception
except ValueError as e:
print(f"Caught: {e}")Quick Reference
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
# Simple thread
t = threading.Thread(target=func, args=(arg1, arg2))
t.start()
t.join()
# Thread pool
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(func, items)
# Lock for shared state
lock = threading.Lock()
with lock:
modify_shared_state()
# Thread-safe queue
from queue import Queue
q = Queue()
q.put(item)
item = q.get()Final Thoughts
Threading in Python isn't as scary as it seems once you understand:
- The GIL makes threading great for I/O, useless for CPU
- Use locks when multiple threads modify shared state
- ThreadPoolExecutor is cleaner than managing threads manually
- Use multiprocessing for CPU-bound parallelism
Start with ThreadPoolExecutor for most cases. It handles the hard parts and keeps your code clean. Only reach for lower-level primitives when you need more control.