The same eval scored two ways: a buggy run averaging 3.40 and the corrected run averaging 4.95

A Green Eval Was Lying to Me

Adding strict typing to my eval harness exposed an intermittent judge failure, and fixing that exposed a second bug a passing eval had been hiding. The lesson: a green eval is not a correct eval.

June 10, 2026 · 6 min · Jack Monte