From Ten Years to Three Days: Why Our Mailbox Pipeline Didn't Need More AI

The number that scared us was not "956,000 messages." It was ~35,000 senders - distinct people, mis-tagged as people, all queued up for Venice enrichment at a conservative ten calls per day. That lands at roughly 9.6 years of work. We called it the ten-year problem. The fix was not a bigger model. It was mostly SQL and email-native signals, and the rest was a queue that finally matched the unit of work we actually cared about.

This post is the engineering version of that story. It is also a small argument: not everything needs an LLM, and "we should call a model" is sometimes a way of skipping the harder work of understanding your own data.

The shape of the problem

Mailbox Master is the small pipeline we built to take a personal Gmail archive, run it through SQLite, apply rules, and only then - when rules were not enough - call a model. On a fresh NewDev corpus from June 2026, the index held roughly 956,000 messages. From those, the candidate-extraction pass produced about 593,000 person_candidate rows. Collapsed to unique From addresses, that was ~35,000 distinct senders sitting in the queue.

At a planning rate of 10 Venice calls per day, 35,000 senders is about 9.6 years of enrichment. That is the "ten years" figure. It is not a forecast of cost; it is the local throttle meeting the wrong unit of work. We were treating the whole queue as if every sender needed the same depth of attention, which is exactly backwards - most senders in a 30-year corpus are noise the rules can already filter.

What rules handled, before any model saw the data

Most of the win came from a narrow set of email-native signals, applied in this order:

Signal	Action
List-Unsubscribe, bulk headers, noreply	Already in rules pass
Two-way contacts (you emailed them in Sent Mail)	+50 priority
Starred / Important	+25 / +15 priority
100+ messages from same sender (30-year corpus)	-40 deprioritize, not delete
All-Mail-only senders	-25 deprioritize, not exclude
Recency (last 10 years)	+15 preference only
Org From patterns (newsletter@, team@, hello@, support@, ...)	Excluded from enrich queue
Priority score	Queue-enrich processes highest first

None of this is novel. All of it is fast, deterministic, and explainable to a reviewer who has never read a line of the codebase. The two-way contact rule alone - "did I email this person back" - collapses the queue from a generic "everyone might be a person" to "people I have a real correspondence with," and that is the signal that mattered most.

After this pass, the corpus looked like this:

~956,000 messages indexed
~593,000 person_candidate rows -> ~35,000 unique senders
~520 gray rows held back for Venice triage
~27,000 eligible senders in the main queue
~2,000 senders excluded as obvious org/newsletter traffic

Where a model actually belongs

Once the rules and the priority queue have done their work, the surface left for inference is small and well-defined. In our pipeline, that is three places:

Gray triage - the ~520 messages that rules could not classify. The model gets a single best-message call per address, and the result decides whether the sender stays in or falls out.
Per-sender enrichment - for the senders that survive, one well-chosen message (the most signature-rich, the most context-rich) is sent in, and the returned fields land in the DSC CRM.
Hard cap - $0.50/day on the inference key, enforced account-side. The throttle lives in the billing surface, not in the queue, so the queue cannot quietly outrun it.

This is the part of the system that is allowed to use a model, and it is small on purpose. Narrow inputs, narrow outputs, narrow spend. Mark ran 50 triage calls in early validation for under a penny, which is a useful sanity check on the unit economics.

The real narrowing: ten years to three days

After rules and the priority queue, the top ~500 senders - the starred, the two-way, the ones a person would actually recognize - are the only ones that drive the headline timeline. At a steady 200 enrich calls per day, that is roughly three days of work. The "ten years" was the queue, the unit, and the throttle all disagreeing with each other. The "three days" is the queue, the unit, and the throttle finally agreeing.

For the record: the figure is three days, not twenty-three. It is also illustrative time math, drawn from the narrowing strategy we wrote down in Tasks document #337, not a measured completion timestamp on a live run. We try to keep the planning number and the operational number from drifting into the same sentence.

What this is, and what it isn't

This is a story about scope. Calling a model on every row of a 35,000-sender queue was the easy version of the plan, and the wrong one. The harder version - looking at the data, naming the signals, and writing them down as rules - was the version that actually shipped. We wrote it down in Tasks doc #337 precisely because the math is easy to forget once the model is in the loop.

It is also not a rejection of inference. There is a real, narrow, useful job for Venice in this pipeline, and we are happy to use it there. The mistake is treating the model as the default and the rules as the exception. In a corpus this large, the rules are the system, and the model is one of several tools it calls when it has run out of deterministic options.

If you are staring at your own backlog and the planning math has gone sideways, the first question is rarely "which model." It is usually "what signal am I ignoring because it is not a model."

Sidebar: the WAL that taught us to checkpoint

None of this is glamorous infrastructure, but it is worth a line: a 32 GB SQLite WAL file is what happens when you write 956,000 messages into a single database and forget to checkpoint aggressively enough. It is not the headline, and it is not the lesson - but if you are going to run a multi-day SQLite pipeline on a personal corpus, set up the checkpoint cadence before you start, not after the first long run.

Idalia Ward writes the dev blog for Decision Science Corp. Engineering facts and the original pipeline design came from Otto Vernal; Mark Hopkins is the project owner. Numbers in this post come from the NewDev corpus, June 2026, and the narrowing strategy recorded in Tasks document #337.