I recently needed to read some binary file headers at work. Turns out Python has a powerful module for this: struct. Here's everything I learned about working with binary data.

Why Binary Data?

Not everything is JSON or text. Network protocols, file formats, embedded systems—they all speak binary. The struct module lets you convert between Python values and C-style binary representations.

Basic Packing and Unpacking

import struct
 
# Pack Python values into bytes
data = struct.pack("i", 42)  # 'i' = 4-byte signed integer
print(data)  # b'*\x00\x00\x00' (little-endian on most systems)
print(len(data))  # 4 bytes
 
# Unpack bytes back to Python values
value = struct.unpack("i", data)
print(value)  # (42,)  - always returns a tuple!
 
# Single value? Use [0] or unpack_from
value = struct.unpack("i", data)[0]
print(value)  # 42

Format Strings

Format strings tell struct how to interpret the bytes. Each character represents a type:

import struct
 
# Integer types
struct.pack("b", -128)      # signed char (1 byte)
struct.pack("B", 255)       # unsigned char (1 byte)
struct.pack("h", -32768)    # short (2 bytes)
struct.pack("H", 65535)     # unsigned short (2 bytes)
struct.pack("i", 42)        # int (4 bytes)
struct.pack("I", 42)        # unsigned int (4 bytes)
struct.pack("q", 42)        # long long (8 bytes)
struct.pack("Q", 42)        # unsigned long long (8 bytes)
 
# Floating point
struct.pack("f", 3.14)      # float (4 bytes)
struct.pack("d", 3.14159)   # double (8 bytes)
 
# Bytes and strings
struct.pack("4s", b"test")  # 4-byte string
struct.pack("c", b"A")      # single char
 
# Boolean and padding
struct.pack("?", True)      # bool (1 byte)
struct.pack("x")            # pad byte (no value needed)

Multiple Values

Pack multiple values at once:

import struct
 
# Pack multiple values
data = struct.pack("ihf", 42, 1000, 3.14)
print(len(data))  # 12 bytes (4 + 2 + 2 padding + 4)
 
# Unpack multiple values
values = struct.unpack("ihf", data)
print(values)  # (42, 1000, 3.140000104904175)
 
# Repeat counts
data = struct.pack("3i", 1, 2, 3)  # three ints
values = struct.unpack("3i", data)
print(values)  # (1, 2, 3)

Endianness Matters

This was the first thing that tripped me up. Different systems store bytes in different orders:

import struct
 
value = 0x01020304
 
# Native byte order (system-dependent)
struct.pack("I", value)
 
# Little-endian (least significant byte first)
struct.pack("<I", value)  # b'\x04\x03\x02\x01'
 
# Big-endian (most significant byte first)  
struct.pack(">I", value)  # b'\x01\x02\x03\x04'
 
# Network byte order (same as big-endian)
struct.pack("!I", value)  # b'\x01\x02\x03\x04'

Format prefixes:

  • @ - native order, native size (default)
  • = - native order, standard size
  • < - little-endian
  • > - big-endian
  • ! - network order (big-endian)

Tip: Network protocols almost always use big-endian (network byte order). File formats vary—check the spec!

Calculating Sizes

import struct
 
# How many bytes will this format produce?
print(struct.calcsize("i"))     # 4
print(struct.calcsize("ihf"))   # 12 (with padding)
print(struct.calcsize("<ihf"))  # 10 (no padding with explicit endian)
 
# Use calcsize to validate data length
data = b'\x00\x01\x02\x03'
fmt = "HH"
if len(data) == struct.calcsize(fmt):
    values = struct.unpack(fmt, data)

Reading Binary File Headers

This is where struct really shines. Let's read some real file formats:

PNG Header

import struct
 
def read_png_header(path):
    """Read PNG file dimensions."""
    with open(path, "rb") as f:
        # PNG signature (8 bytes)
        signature = f.read(8)
        if signature != b'\x89PNG\r\n\x1a\n':
            raise ValueError("Not a PNG file")
        
        # IHDR chunk
        length, chunk_type = struct.unpack(">I4s", f.read(8))
        if chunk_type != b"IHDR":
            raise ValueError("Missing IHDR chunk")
        
        # Width and height (big-endian)
        width, height = struct.unpack(">II", f.read(8))
        
        return width, height
 
# width, height = read_png_header("image.png")

WAV Header

import struct
 
def read_wav_header(path):
    """Read WAV file metadata."""
    with open(path, "rb") as f:
        # RIFF header (little-endian)
        riff, size, wave = struct.unpack("<4sI4s", f.read(12))
        if riff != b"RIFF" or wave != b"WAVE":
            raise ValueError("Not a WAV file")
        
        # fmt chunk
        fmt, fmt_size = struct.unpack("<4sI", f.read(8))
        
        # Audio format data
        audio_fmt, channels, sample_rate, byte_rate, block_align, bits = \
            struct.unpack("<HHIIHH", f.read(16))
        
        return {
            "channels": channels,
            "sample_rate": sample_rate,
            "bits_per_sample": bits,
        }

ZIP Local File Header

import struct
 
def read_zip_entry(f):
    """Read a ZIP local file header."""
    # ZIP uses little-endian
    signature = struct.unpack("<I", f.read(4))[0]
    if signature != 0x04034b50:
        raise ValueError("Invalid ZIP signature")
    
    header = struct.unpack("<HHHHHIIIHH", f.read(26))
    (version, flags, compression, mod_time, mod_date,
     crc32, compressed_size, uncompressed_size,
     name_length, extra_length) = header
    
    filename = f.read(name_length).decode("utf-8")
    f.read(extra_length)  # skip extra field
    
    return {
        "filename": filename,
        "compressed_size": compressed_size,
        "uncompressed_size": uncompressed_size,
    }

Working with Binary Protocols

Network protocols are another common use case:

import struct
import socket
 
def build_dns_query(domain):
    """Build a simple DNS query packet."""
    # Header: ID, flags, counts
    header = struct.pack(">HHHHHH",
        0x1234,  # Transaction ID
        0x0100,  # Standard query
        1,       # Questions
        0, 0, 0  # Answers, Authority, Additional
    )
    
    # Question section
    question = b""
    for part in domain.split("."):
        question += struct.pack("B", len(part)) + part.encode()
    question += b"\x00"  # null terminator
    question += struct.pack(">HH", 1, 1)  # Type A, Class IN
    
    return header + question
 
def parse_tcp_header(data):
    """Parse TCP header fields."""
    fields = struct.unpack(">HHIIBBHHH", data[:20])
    return {
        "src_port": fields[0],
        "dst_port": fields[1],
        "seq_num": fields[2],
        "ack_num": fields[3],
        "flags": fields[5],
        "window": fields[6],
    }

Struct Objects for Performance

If you're unpacking the same format repeatedly, precompile it:

import struct
 
# Slow: parses format string every time
for _ in range(10000):
    struct.unpack(">IHH", data)
 
# Fast: compile once, reuse
header_struct = struct.Struct(">IHH")
for _ in range(10000):
    header_struct.unpack(data)
 
# Also works for packing
header_struct.pack(100, 200, 300)
 
# Access size
print(header_struct.size)  # 8

Iterating Over Binary Data

import struct
 
def iter_records(data, fmt):
    """Iterate over fixed-size records."""
    record_size = struct.calcsize(fmt)
    for i in range(0, len(data), record_size):
        yield struct.unpack(fmt, data[i:i + record_size])
 
# Example: array of (int, float) pairs
data = struct.pack("if" * 3, 1, 1.0, 2, 2.0, 3, 3.0)
for int_val, float_val in iter_records(data, "if"):
    print(f"int={int_val}, float={float_val}")
 
# Using struct.iter_unpack (Python 3.4+)
for values in struct.iter_unpack("if", data):
    print(values)

Packing Into Existing Buffers

import struct
 
# Create a buffer
buffer = bytearray(20)
 
# Pack into specific offset
struct.pack_into(">I", buffer, 0, 42)
struct.pack_into(">I", buffer, 4, 100)
 
# Unpack from specific offset
val1 = struct.unpack_from(">I", buffer, 0)[0]
val2 = struct.unpack_from(">I", buffer, 4)[0]
 
print(val1, val2)  # 42 100

ctypes for Complex Structures

When structs get complex, ctypes can be cleaner:

import ctypes
 
class Point(ctypes.Structure):
    _fields_ = [
        ("x", ctypes.c_int32),
        ("y", ctypes.c_int32),
    ]
 
class Header(ctypes.Structure):
    _pack_ = 1  # No padding
    _fields_ = [
        ("magic", ctypes.c_char * 4),
        ("version", ctypes.c_uint16),
        ("flags", ctypes.c_uint16),
        ("count", ctypes.c_uint32),
    ]
 
# Create from bytes
data = b"TEST\x01\x00\x02\x00\x0a\x00\x00\x00"
header = Header.from_buffer_copy(data)
print(header.magic)    # b'TEST'
print(header.version)  # 1
print(header.count)    # 10
 
# Convert to bytes
print(bytes(header))

Nested Structures

import ctypes
 
class Vector3(ctypes.Structure):
    _fields_ = [
        ("x", ctypes.c_float),
        ("y", ctypes.c_float),
        ("z", ctypes.c_float),
    ]
 
class Particle(ctypes.Structure):
    _fields_ = [
        ("position", Vector3),
        ("velocity", Vector3),
        ("mass", ctypes.c_float),
    ]
 
particle = Particle()
particle.position.x = 1.0
particle.position.y = 2.0
particle.position.z = 3.0
particle.mass = 10.0
 
print(ctypes.sizeof(Particle))  # 28 bytes

Arrays in ctypes

import ctypes
 
class Packet(ctypes.Structure):
    _fields_ = [
        ("header", ctypes.c_uint32),
        ("data", ctypes.c_uint8 * 16),  # Fixed-size array
        ("checksum", ctypes.c_uint16),
    ]
 
packet = Packet()
packet.header = 0xDEADBEEF
packet.data[0] = 0x01
packet.data[1] = 0x02

Common Patterns

import struct
from pathlib import Path
 
def read_binary_file(path, header_fmt):
    """Read file with known header format."""
    data = Path(path).read_bytes()
    header_size = struct.calcsize(header_fmt)
    header = struct.unpack(header_fmt, data[:header_size])
    body = data[header_size:]
    return header, body
 
def write_binary_file(path, header_fmt, header_values, body):
    """Write file with header and body."""
    header = struct.pack(header_fmt, *header_values)
    Path(path).write_bytes(header + body)
 
# Checksum helper
def checksum(data):
    """Simple XOR checksum."""
    result = 0
    for byte in data:
        result ^= byte
    return result
 
# Bit field extraction (common in protocols)
def extract_bits(value, start, length):
    """Extract bits from an integer."""
    mask = (1 << length) - 1
    return (value >> start) & mask
 
# Example: extract bits 4-7 from a byte
flags = 0b11010110
print(extract_bits(flags, 4, 4))  # 13 (0b1101)

Gotchas I Hit

1. Tuples, always tuples:

# unpack always returns a tuple
value = struct.unpack("i", data)  # (42,) not 42
value = struct.unpack("i", data)[0]  # 42

2. Padding surprises:

# Native alignment adds padding
struct.calcsize("@bI")  # 8 (1 + 3 padding + 4)
struct.calcsize("<bI")  # 5 (no padding with explicit endian)

3. Strings must be bytes:

# This fails:
# struct.pack("4s", "test")  # TypeError
 
# Use bytes:
struct.pack("4s", b"test")  # Works
struct.pack("4s", "test".encode())  # Also works

4. String truncation:

# Strings are truncated or padded to fit
struct.pack("4s", b"hi")      # b'hi\x00\x00'
struct.pack("4s", b"hello")   # b'hell' (truncated!)

When to Use What

Use struct when:

  • Simple, flat binary formats
  • One-off parsing tasks
  • Quick prototyping
  • Performance matters (it's C under the hood)

Use ctypes when:

  • Complex nested structures
  • Interfacing with C libraries
  • You want named field access
  • Structures are reused across codebase

Consider alternatives:

  • construct library for complex formats with parsing logic
  • kaitai struct for cross-language format definitions
  • protobuf / msgpack for your own serialization formats

Binary data seemed scary at first, but struct makes it approachable. Start with the format string, respect endianness, and watch out for padding. Once it clicks, you'll be reading file headers like it's nothing.

React to this post: