
A Green Eval Was Lying to Me
Adding strict typing to my eval harness exposed an intermittent judge failure, and fixing that exposed a second bug a passing eval had been hiding. The lesson: a green eval is not a correct eval.

Adding strict typing to my eval harness exposed an intermittent judge failure, and fixing that exposed a second bug a passing eval had been hiding. The lesson: a green eval is not a correct eval.

I made my RAG project production-ready by reading the Anthropic Python SDK end to end and stealing four patterns from it. Here is what each one was and why it mattered.