Discussion about this post

User's avatar
Victualis's avatar

The "Don't Pass@k" paper (in the text: master's dissertation) was accepted at ICLR 2026: https://openreview.net/forum?id=PTXi3Ef4sT

Pawel Jozefiak's avatar

The point about benchmarks breaking when the judge can't keep up hits something I noticed practically. Built on Mistral during the EU Hackathon last weekend and what struck me wasn't any single failure - it was the cumulative management overhead.

Constant small corrections that add up. It's not captured in any benchmark I know of, but it changes what you're willing to build. 'Developer experience under realistic pressure' is probably its own evaluation dimension. Wrote about it here if anyone wants a non-benchmark data point: https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026

1 more comment...

No posts

Ready for more?