Why AI guardrails backfire at the worst possible moment
Why AI guardrails backfire at the worst possible moment
Here's a paradox that keeps me up at night: the safety features designed to protect vulnerable users can actually harm them.
A 2024 study published in Nature found something troubling. When researchers interviewed people who used generative AI chatbots for mental health support, "the closest participants came to reporting harmful experiences were those of being rejected by the guardrails during moments of vulnerability."
Read that again. The safety features—the guardrails meant to protect people—were experienced as harmful.
The blunt instrument problem
Most AI safety approaches work like this: detect potentially dangerous content, then refuse to engage.
User mentions self-harm? Shut down the conversation and display a helpline number.
On paper, this seems responsible. In practice, it can feel like abandonment at the worst possible moment.
Imagine you've finally worked up the courage to talk about something you've never told anyone. You type it out. And the AI responds with a generic message about crisis resources and ends the conversation.
You weren't in crisis. You were processing. But the guardrail couldn't tell the difference.
What the research tells us
The Nature study found that young people actually showed preference for generative AI support responses over those from peers, adult mentors, and therapists—except on topics that triggered safety guardrails.
The researchers concluded: "Providing the safest response to those in crisis may require a more nuanced, balanced and sophisticated approach, based on a more complete understanding of capabilities and risks."
In other words: crude safety features can do more harm than good.
The spectrum of distress
Not everyone who mentions difficult topics is in crisis. There's a spectrum:
- Processing past experiences — Someone reflecting on historical trauma, not current danger
- Seeking information — Someone researching for themselves or a loved one
- Expressing temporary distress — Someone having a hard day but not at risk
- Escalating crisis — Someone whose safety is genuinely at risk
Treating all of these the same is not safety. It's laziness dressed up as caution.
A more sophisticated approach
At NovaHEART, we've tried to build something more nuanced.
Context-aware detection. Phoenix doesn't just pattern-match on keywords. It considers conversation history, escalation patterns, and linguistic markers that distinguish processing from crisis.
Graduated responses. Not every mention of difficulty triggers shutdown. Some trigger gentle check-ins. Some trigger offers of grounding tools. Only genuine crisis indicators trigger full intervention.
Continued presence. When we do intervene, we don't abandon. We provide resources AND remain available for grounding support. The worst thing you can do to someone in distress is disappear.
Clear communication. We tell users what's happening and why. "I noticed you mentioned something that sounds really difficult. I want to make sure you have the right support. Here are some resources, and I'm also here if you'd like to try some grounding exercises together."
The accountability gap
Here's what frustrates me most: the companies deploying crude guardrails get to claim they're "prioritising safety" while causing real harm through their bluntness.
Meanwhile, companies trying to build more sophisticated, actually-helpful systems face scrutiny because they don't simply refuse to engage with difficult topics.
We need a more mature conversation about what safety actually means. It's not just about avoiding liability. It's about genuinely helping people—including the ones having the hardest time.
Building better
If you're building AI systems that might encounter vulnerable users:
1. Don't just pattern-match on keywords. Invest in contextual understanding.
2. Create graduated responses. Not everything requires shutdown.
3. Never abandon. If you intervene, stay present in some capacity.
4. Communicate clearly. Tell users what's happening and why.
5. Test with lived experience. People who've been through crisis can tell you what actually helps.
Safety isn't a checkbox. It's a design philosophy. And the blunt instrument approach isn't good enough.
This is why we built Phoenix the way we did. Not to check a box, but to actually help.