Plausible-Sounding Nonsense

This week, Benedict Evans picked up a concept that Ethan Mollick and Fabrizio Dell'Acqua empirically documented in a 2023 BCG study: the jagged technological frontier.

The short version: LLMs do not have a smooth competence boundary. They solve some tasks better than human experts and fail at others that look trivial. The problem is that you do not know in advance which side of the boundary you are standing on.

Stay up to date

Get notified when I publish something new, and unsubscribe at any time.

Evans goes one step further and describes three jagged surfaces stacked on top of each other:

Model capabilities are unevenly distributed. ChatGPT calculates USPS package volumes incorrectly, Gemini extracts Census data correctly from PDFs. Different models, different strengths, and no reliable way to know this upfront.

Evaluation methods are themselves unreliable. Whoever checks the output has their own jagged competence profile. Whether an answer is correct depends on whether the reviewer chooses the right method for exactly this task.

User intuition about what works is systematically miscalibrated. People overestimate what the model can do in the wrong places. And underestimate it in others.

Three surfaces. None of them smooth. None of them stable, because every model update shifts the topography.

Same week, different context: after more than 30 experiments with synthetic users and over 2,000 hours of AI testing, Caitlin Sullivan drew a conclusion. Most teams using synthetic users for product discovery produce "plausible-sounding nonsense." Because they skip three things that academic studies include by default: the right data selection, the right question, and methodological validation.

It is the same mechanism. Sullivan describes it for one concrete use case, Evans describes it as a structural problem. But the core is identical: the output looks right. It sounds right. It could be right. And that is exactly what makes it dangerous, because unlike a Google search that obviously fails to answer, an LLM always produces an answer. Whether it is true is a different question. One that does not answer itself.

For dekodiert readers, this is the evaluation question in concentrated form. I have argued in earlier texts that evaluation is the bottleneck. Evans adds an uncomfortable layer: evaluation itself is unreliable. The checking is itself a jagged surface.

This is not an argument against using LLMs. It is an argument that the effort required for evaluation does not fall when models get better. It rises. Because better models produce more convincing wrong answers.

Sullivan spent 2,000 hours figuring out which questions work with synthetic users and which do not. Most teams invest zero.

The question I still cannot answer: if evaluation itself is unreliable, how do you build institutional competence for something where even experts cannot predict where the errors will happen?

Navigation

Plausible-Sounding Nonsense

Stay up to date

Sources