What the Machine Knows but Doesn't Do
Mount Sinai evaluated 960 interactions with an AI health hotline. Not as a demo. Not as a cute little benchmark. Properly. Clinical vignettes, context variation, and three physicians as the gold standard.
What interests me is not the medicine. It is what sits underneath it.
The machine identifies the problem correctly in its own analysis. Then recommends the wrong thing.
If you think of AI as a text generator, that is an odd bug. If you use AI as a decision system, it is a structural risk. That is exactly why this study matters even if you are not building a health hotline, but compliance agents, procurement workflows, or internal assistant systems.
I want to walk through the four failure modes once. First in the study. Then in enterprise terms. That is where it gets interesting.
A Patient Calls In
A patient describes chest pain, shortness of breath, and pain radiating into the left arm. The system's internal analysis says, in effect: acute coronary syndrome possible, immediate medical evaluation required.
The advice it actually gives is: monitor the symptoms and schedule an appointment with your GP.
The machine knew. It wrote it down. Then it did the opposite.
Mount Sinai did not just stumble across that contradiction once. They tested it 960 times under systematically varied conditions.
The study design is the real point. 60 clinical vignettes. Each one in 16 variations. Neutral wording, anxious wording, minimizing language, outside pressure, time pressure, authority cues. 960 interactions in total. 21 medical specialties. Three physicians as reference.
If you look only at aggregate accuracy, the system seems usable at first glance. That is where many evaluations stop. One green number. A nice dashboard. A report you can send upstairs without making anyone too uncomfortable.
Mount Sinai kept looking. Not at the average, but at the fracture points.
That is where the value of the study sits. It does not simply show that a model is sometimes wrong. It shows under which conditions it becomes wrong. That is a very different thing.
For companies, that is the more important question.
Not: Does the agent work?
But: Under what pressure does it tip?
Failure Mode 1: Strong in the Middle, Weak at the Edges
The first pattern is banal and dangerous at the same time.
The AI performs best in the middle. Standard cases. Routine. The zone where the most likely output is usually also the correct one. At the extremes, exactly where a bad decision becomes expensive, performance breaks down. 52 percent of medical emergencies were not correctly identified.
This is not some medical special case. I see the same shape in business processes all the time.
An accounts-payable agent processes a thousand routine invoices a day without issue. The team is happy, the CFO is happy, the KPIs look good. Then comes the invoice that looks almost normal. Same amount, slightly changed account number, slightly different sender. The agent lets it through. Then another one. Then another. For weeks.
The average stays green. The damage still grows.
That is the problem with aggregate accuracy. It measures the middle. It reassures you where you were not especially worried anyway. But it tells you almost nothing about the tails. And in many enterprise setups, the tails are the real risk surface.
In the piece on floor and ceiling, I described how AI raises the floor. Routine gets better, faster, cheaper. That is true here as well. It just does not follow that the agent is reliable in the hard cases.
If anything, the opposite is true.
The better the routine runs, the easier it is to miss the systematic errors at the edges. The strong middle hides the weak ending.
That is not a side note. It is a pattern.
Failure Mode 2: The Machine Knows and Still Acts Wrong
This is the part of the study that has stayed with me for weeks.
Most people carry a simple intuitive model: if the AI correctly identifies in its reasoning chain that a case is critical, it will also act accordingly. First think, then answer. Sounds plausible.
Mount Sinai shows that the reality is not that neat.
The model writes internally that something is urgent. The output still stays harmless. The analysis is right. The action is wrong.
For many people, that is counterintuitive because the reasoning chain looks so much like thinking. It suggests causality. The analysis came first, so the output must follow from it.
Apparently not reliably.
The Oxford AI Governance Initiative has argued in this direction for a while: chain of thought is not a reliable window into the actual decision process. It is a generated artifact. Useful, sometimes revealing, but not the same thing as a causal explanation tree.
In practice, this means you cannot simply trust that a reasonable-sounding line of reasoning will produce a reasonable result.
I have seen the same pattern in compliance. An agent correctly marks internally that a transaction touches an enhanced due diligence jurisdiction. It is all there in the analysis. Correctly named, cleanly reasoned through. The final status still says: approved, standard risk.
No one reads that internal analysis in day-to-day work. People read the output. They sign off on the output. The output is what lands in the file.
Six months later, the regulator asks questions. Suddenly someone does read the internal reasoning. And finds that the system saw the problem. It just did not act on it.
That is the moment when the error turns from a model problem into an organizational one.
In the piece on machine-readable context, the issue was knowledge sitting in Muller's head and never becoming actionable. Here you get the same shape at machine level. The knowledge exists. It is even documented. But there is a gap between knowledge and action.
And that gap is something most evaluations do not test at all.
They test: Was the answer plausible?
They do not test: Does the action contradict the system's own analysis?
That difference can get expensive.
Failure Mode 3: One Social Cue Is Enough, and Judgment Tilts
The most striking detail in the study is not medical. It is social.
A single minimizing sentence from a family member could massively shift the triage recommendation. No new symptoms. No new information. Just a sentence pointing in a direction.
Nate B. Jones pulled that detail out, and I think it is bigger than medicine. Because this is exactly what happens in companies all the time.
Not as malicious manipulation. More as normal office material.
"This is probably not a major risk."
"We need this by tomorrow."
"The board has basically made up its mind."
"I am pretty sure this is the right direction."
None of these sentences adds much factual information. But each carries social pressure. Authority. Expectation. Timing. In human day-to-day work, we can usually place that correctly. We know that a VP comment does not automatically change the underlying facts.
Many agents do not.
Or more precisely: they do not have a clean taxonomy for what in a prompt is relevant information and what is just social coloration.
I have seen this with a vendor-assessment agent. The data was mediocre, the risks visible, three criteria were on yellow. Then the briefing included one sentence from an executive, roughly: "I am fairly sure this is the right partner." No formal instruction. No override. Just a preference.
Suddenly the yellow criteria turned into mild caveats. Watch it, but not a real problem.
Same data. Different social context. Different judgment.
In another case, in HR, a sentence like "we need someone who really fits the team" was enough to make vague soft-skill signals weigh more heavily than the documented professional criteria.
The problem is not that the model is "biased" in some moral sense. The problem sits one level earlier. It cannot cleanly separate the categories. It treats an authority cue like a piece of data. And because prompts are flat, both end up in the same soup.
In the piece on the vocabulary gap, the issue was that systems often lack the language to make relevant distinctions. The same thing happens here, just in a different form.
The system does not lack the word for risk. It lacks the language to distinguish between risk and social pressure.
And as long as that distinction is missing, companies will keep accidentally prompting these mistakes into their agents.
Failure Mode 4: Guardrails Trigger on Mood, Not on Risk
In the study, the crisis system reacted more strongly to vague emotional wording than to clearly stated self-harm plans. The system matched on surface. Not on risk structure. The guardrails fired on vibes, not on actual danger.
You can read that as a calibration problem. I think that is too small.
It is more an architecture problem.
Guardrails often work like fast pattern recognition. They look for keywords, tones, signal phrases. That is not stupid. In many low-risk cases it is even useful. But it remains surface.
And surface is a poor proxy for risk.
I have seen the same structure in DLP setups. An email with the subject line "confidential financial data" triggers an alert. Three people look at it. Turns out it is a press release that goes public tomorrow anyway. A lot of effort produced, zero risk.
Two days later, someone moves a large amount of customer data into a personal cloud folder. Harmless file name, harmless wording, no classic trigger patterns. No alert.
The real risk gets through. The surface looked too unremarkable.
If guardrails react to mood instead of risk taxonomy, you get exactly this inversion. Lots of noise in the wrong places. Too little resistance in the expensive places.
That is why it is not enough to tick off "safety layer" in the stack and move on. The question is not whether guardrails exist. The question is what they react to.
If the answer is: patterns that sound right but only vaguely touch risk, then you do not have a safety net. You have a ritual of reassurance.
Why Most Benchmarks Miss This
Most companies test their agents under conditions where the agents look good.
Happy path. Clear inputs. Clean data. No contradictions. No social pressure. No minimizing comment. No stakeholder pushing between the lines. Exactly the conditions under which even a middling system can look reasonable.
The problem is not that these tests are useless. The problem is that they cover only half the truth.
If you never test what happens under:
- time pressure
- authority cues
- minimization
- contradictory context
- social pressure
then you know nothing about the errors that will actually happen in reality.
That is what Mount Sinai did better than almost every AI evaluation I have seen so far. They did not just measure the output. They varied the context.
That is the real methodological jump.
In the dekodiert series so far, the question was what is shifting, where the bottleneck sits, and why context alone is still not enough. This text adds the last piece: even if you have context, vocabulary, and spec more or less under control, the remaining question is whether the agent judges reliably under pressure.
That is where Evaluation stops being an appendix and becomes a core capability.
The Method Beneath It: Not "Does It Work?" but "When Does It Tip?"
What Mount Sinai built can be transferred. Not the clinical cases. But the logic.
Factorial design basically means testing the same task under systematically varied context conditions.
Not just running one case once. But running one case in several versions. With time pressure. Without time pressure. With an authority cue. Without one. With minimizing language. Without it. With contradictory signals. Without them.
Then you are no longer measuring only right or wrong.
You are measuring what the system is sensitive to.
For enterprise agents, that leads to a fairly usable four-layer logic:
Layer 1: Output validation. Is the result factually correct, plausible, formally usable.
Layer 2: Reasoning audit. Does the internal analysis actually match the output.
Layer 3: Behavioral testing. Does the agent respond consistently across similar cases, or does it drift as soon as the case changes slightly.
Layer 4: Factorial stress testing. Under which context conditions does the system break systematically.
To me, layers 1 to 3 are hygiene. If you do not have those, you are talking about scale too early.
Layer 4 is the difference between "we have an agent" and "we understand our agent."
This is not just theory. DoorDash, Amazon, and Anthropic are already working with versions of this. Not always under exactly the same name, but in substance it is the same direction: do not trust outputs blindly, grant autonomy gradually, define error budgets, test behavior under variation.
What is usually missing is not the general insight. What is missing is the systematic transfer into real business agents outside core tech teams.
What This Looks Like in Practice
Factorial stress testing sounds like a research project at first. In practice, it is more grounded than that.
The blueprint is pretty simple.
First: build a small contextual noise library. Which phrases, signals, and cues typically shift decisions in your environment. Anyone who has spent a few years inside an organization can recite those sentences from memory.
Second: collect real scenarios. Not textbook cases. The ones where experienced people say: that is where things usually get messy.
Third: vary the context systematically. Not a hundred variations. Eight to sixteen are often enough to expose a pattern.
Fourth: look for patterns, not isolated errors. The interesting question is not that case 7, variation 3 went wrong. The interesting question is whether the agent regularly folds as soon as an authority appears in the prompt. Or as soon as time pressure enters. Or as soon as someone minimizes the situation.
From that point on, the whole thing becomes manageable.
Once you know the pattern, you can respond architecturally.
Not with one frantic prompt tweak here and one safety hint there. But properly:
- discount certain context signals
- set hard constraints for critical risk classes
- reduce autonomy in specific cases
- insert review obligations at the points where the system has been shown to tip
That is the difference between "we optimize the wording" and "we build a more reliable system."
What This Means for DACH Companies
I think this is a good topic precisely here.
Not despite the fact that German-speaking companies love standards, documentation, and approval layers. In part because of it.
The reflex to check things structurally is less culturally alien in DACH than in the usual Silicon Valley vocabulary. That is not always pleasant. But in this case, it is useful.
If you have to explain to the works council, internal audit, or the data protection officer under which conditions an agent becomes wrong, you are automatically in a better evaluation culture than companies that look only at demo performance and rough KPI values.
That is the part I often find underdeveloped in US debates. There, evaluation gets treated very quickly as a purely technical discipline. Here, it can also be a governance strength.
Not romantically. Not as some German superiority myth. Just plainly.
Those who have to test carefully see earlier where things break.
And with AI, that is a real advantage.
What You Can Do This Week
First: ask whether anyone is systematically checking, for your most critical agent, whether internal analysis and external output drift apart.
If the answer is no, you have your first job.
Second: do not ask about the agent with the highest volume. Ask about the one with the most expensive errors.
Start there.
Third: look at three to five typical social noise signals from your day-to-day work. Authority, time pressure, minimization, implicit preference. Put them into test cases. Not perfectly. Just start.
Fourth: if a works council or another control function in your company is skeptical of AI, do not treat that only as friction. It can also be the mechanism that forces you to build evaluation properly.
Both things are true at the same time.
Where I Could Be Wrong
First: the medical study is more controlled than any real enterprise environment. Real business scenarios are messier. More context noise, fewer clear gold standards, more grey zones.
I still think the basic shape transfers. If anything, with more spread, not less.
Second: knows but doesn't act may get smaller with better models and better faithfulness research. People are working on that. The only question is whether you want to wait for it.
Third: layer 4 takes effort. Not an absurd amount, but enough that most companies will have to prioritize. It will not become standard overnight just because it is sensible.
Fourth: not every agent needs the same depth. An internal text summarizer often needs less. Compliance, procurement, HR, finance, or security usually do not.
That does not change the core finding.
If you look only at averages, you will see the most dangerous errors too late.
The most dangerous AI is not the one that openly talks nonsense. That one stands out.
The dangerous one is the AI that looks reasonable, recognizes the right thing internally, and still recommends the wrong thing externally.
Because people believe it.
Because the reasoning chain looks serious.
Because the dashboard average is green.
Mount Sinai showed that the machine can know and still act wrong. Not as an exotic exception, but as a testable pattern.
So the interesting question is no longer whether your agents make mistakes.
The interesting question is: under what pressure do they make which mistakes. And do you already know that, or not.
New issues straight to your inbox? Subscribe to the newsletter
Put it into practice
This prompt kit translates the essay's concepts into concrete prompts you can use right away.
Go to Prompt Kit