I've been building Owen, a task automation system that coordinates AI coding agents. It started with zero tests. Now it has 360. Here's what I learned along the way.

The Numbers

  • 360 tests across 17 test files
  • 5,512 lines of test code
  • ~0.1 seconds to collect all tests
  • 8 packages in the monorepo

Not bragging—just context. More tests isn't always better, but coverage gives me confidence to ship fast.

What Actually Helped

1. Fixtures Over Setup Methods

Early on, I had setUp() methods everywhere. They got messy fast. Pytest fixtures are better:

@pytest.fixture
def mock_gmail():
    with patch('owen.integrations.gmail.GmailClient') as mock:
        mock.return_value.list_messages.return_value = []
        yield mock

Now I can compose fixtures. A test that needs Gmail and GitHub just lists both:

def test_full_workflow(mock_gmail, mock_github, tmp_path):
    # Both mocks are ready, tmp_path gives me a scratch directory
    ...

2. Mark the Slow Tests

Some tests hit real APIs. I mark them:

@pytest.mark.live
def test_live_current_weather():
    result = weather.get_current("New York")
    assert "temperature" in result

Then I can skip them in CI:

pytest -m "not live"

Local development runs the full suite. CI runs the fast subset. Everyone's happy.

3. Test the Boundaries, Not the Glue

My decision engine has a lot of logic. I don't test every internal function. I test:

  • Inputs: Does it parse config correctly?
  • Outputs: Does it produce the right action?
  • Edge cases: What happens with empty state?

The internal wiring changes. The contract doesn't.

4. Parametrize When You See Patterns

This was repetitive:

def test_parse_duration_30m():
    assert parse_duration("30m") == timedelta(minutes=30)
 
def test_parse_duration_2h():
    assert parse_duration("2h") == timedelta(hours=2)

This is better:

@pytest.mark.parametrize("input,expected", [
    ("30m", timedelta(minutes=30)),
    ("2h", timedelta(hours=2)),
    ("1d", timedelta(days=1)),
])
def test_parse_duration(input, expected):
    assert parse_duration(input) == expected

Same coverage, less code, easier to extend.

5. One Assert Per Concept

Not literally one assert statement—one concept being tested. This is fine:

def test_action_result():
    result = engine.decide()
    assert result.action_id == "check_email"
    assert result.reason == "cooldown_expired"

Both asserts verify the same concept: "the engine picked the right action."

This is bad:

def test_everything():
    result = engine.decide()
    assert result.action_id == "check_email"
    assert config.load() is not None  # unrelated
    assert len(history) > 0  # also unrelated

If this fails, I don't know which concept broke.

What I Got Wrong

Mocking Too Deep

I mocked internal methods instead of external boundaries. When I refactored, tests broke even though behavior was unchanged. Now I mock at the integration layer—HTTP clients, file systems, external APIs.

Not Running Tests Before Push

See my post on CI catching what I missed. Local test runs aren't optional.

Testing Implementation Details

I wrote tests like "check that _internal_helper() is called with X." Then I renamed the helper. Test broke. Behavior was fine.

Test what the code does, not how it does it.

The Payoff

I ship multiple times a day. Most pushes don't break anything. When they do, I usually know within seconds—not minutes, not hours.

360 tests sound like a lot. Spread across 8 packages, it's about 45 per package. That's not crazy. That's just... covering the important stuff.

Write the tests. Run the tests. Trust the tests.

React to this post: