Posts

A managed eval tool adopted but not relied on, with a single item-018 score delta flagged as noise

The Eval Tool I Adopted but Won't Rely On

I migrated my eval harness to a managed tool, got it working, and then decided not to trust it as canonical. A single per-item delta on identical code is why.

The same eval scored two ways: a buggy run averaging 3.40 and the corrected run averaging 4.95

A Green Eval Was Lying to Me

Adding strict typing to my eval harness exposed an intermittent judge failure, and fixing that exposed a second bug a passing eval had been hiding. The lesson: a green eval is not a correct eval.

Four patterns read out of a production SDK and applied to a RAG project

What I Learned Reading a Production SDK Cover to Cover

I made my RAG project production-ready by reading the Anthropic Python SDK end to end and stealing four patterns from it. Here is what each one was and why it mattered.

My Eval Dashboard Showed 1 Trace. I Expected 120.

Adding Langfuse to a RAG pipeline looked done until the dashboard showed one trace and zero scores. The real problem was trace structure, not instrumentation. Here is the gotcha, the fix, and the numbers.

I built an eval harness for my RAG pipeline: 40 questions, 3 scorers, and the one number that told me where it breaks

I Built an Eval Harness for My RAG Pipeline. Here's What the Numbers Revealed.

An automated eval harness with 40 golden questions and three scorers turned ’looks reasonable’ into a precise diagnosis of where my RAG pipeline actually breaks.