The Human Creativity Benchmark. Measuring creative quality means preserving disagreement, not averaging it out.

Contra Labs built a benchmark for evaluating generative AI in creative work — and the most interesting design decision is what they do with disagreement. Instead of collapsing evaluator opinions into a single score, they treat divergence as a signal in its own right. Some differences in judgment are noise; others are legitimate differences in taste. The benchmark tries to separate the two.

What they found: no single model wins across the board. Different models are better at different phases — ideation vs. refinement, constraint-following vs. creative flexibility. And the more specific the brief, the more evaluators agree — constraints produce consensus.

The implication for how we think about AI in design work: "creative quality" isn't one thing. A model that's good at following instructions isn't necessarily good at making interesting choices. Those are different capabilities, and collapsing them into a single score hides the distinction that actually matters for choosing the right tool at the right moment.