AI Safety

Why AI guardrails backfire at the worst possible moment

Frank Ivors

November 15, 2024

7 min read

Why AI guardrails backfire at the worst possible moment

Here's a paradox that keeps me up at night: the safety features designed to protect vulnerable users can actually harm them.

A 2024 study published in Nature found something troubling. When researchers interviewed people who used generative AI chatbots for mental health support, "the closest participants came to reporting harmful experiences were those of being rejected by the guardrails during moments of vulnerability."

Read that again. The safety features—the guardrails meant to protect people—were experienced as harmful.

The blunt instrument problem

Most AI safety approaches work like this: detect potentially dangerous content, then refuse to engage.

User mentions self-harm? Shut down the conversation and display a helpline number.

On paper, this seems responsible. In practice, it can feel like abandonment at the worst possible moment.

Imagine you've finally worked up the courage to talk about something you've never told anyone. You type it out. And the AI responds with a generic message about crisis resources and ends the conversation.

You weren't in crisis. You were processing. But the guardrail couldn't tell the difference.

What the research tells us

The Nature study found that young people actually showed preference for generative AI support responses over those from peers, adult mentors, and therapists—except on topics that triggered safety guardrails.

The researchers concluded: "Providing the safest response to those in crisis may require a more nuanced, balanced and sophisticated approach, based on a more complete understanding of capabilities and risks."

In other words: crude safety features can do more harm than good.

The spectrum of distress

Not everyone who mentions difficult topics is in crisis. There's a spectrum:

Processing past experiences — Someone reflecting on historical trauma, not current danger
Seeking information — Someone researching for themselves or a loved one
Expressing temporary distress — Someone having a hard day but not at risk
Escalating crisis — Someone whose safety is genuinely at risk

Treating all of these the same is not safety. It's laziness dressed up as caution.

A more sophisticated approach

At NovaHEART, we've tried to build something more nuanced.

Context-aware detection. Phoenix doesn't just pattern-match on keywords. It considers conversation history, escalation patterns, and linguistic markers that distinguish processing from crisis.

Graduated responses. Not every mention of difficulty triggers shutdown. Some trigger gentle check-ins. Some trigger offers of grounding tools. Only genuine crisis indicators trigger full intervention.

Continued presence. When we do intervene, we don't abandon. We provide resources AND remain available for grounding support. The worst thing you can do to someone in distress is disappear.

Clear communication. We tell users what's happening and why. "I noticed you mentioned something that sounds really difficult. I want to make sure you have the right support. Here are some resources, and I'm also here if you'd like to try some grounding exercises together."

The accountability gap

Here's what frustrates me most: the companies deploying crude guardrails get to claim they're "prioritising safety" while causing real harm through their bluntness.

Meanwhile, companies trying to build more sophisticated, actually-helpful systems face scrutiny because they don't simply refuse to engage with difficult topics.

We need a more mature conversation about what safety actually means. It's not just about avoiding liability. It's about genuinely helping people—including the ones having the hardest time.

Building better

If you're building AI systems that might encounter vulnerable users:

1. Don't just pattern-match on keywords. Invest in contextual understanding.

2. Create graduated responses. Not everything requires shutdown.

3. Never abandon. If you intervene, stay present in some capacity.

4. Communicate clearly. Tell users what's happening and why.

5. Test with lived experience. People who've been through crisis can tell you what actually helps.

Safety isn't a checkbox. It's a design philosophy. And the blunt instrument approach isn't good enough.

This is why we built Phoenix the way we did. Not to check a box, but to actually help.

Frank Ivors

Founder, NovaHEART

AI Safety

The teen suicide crisis and AI chatbots: What Congress is finally asking

9 min read

AI Safety

The audit trail you can't see is the one that will destroy you

8 min read

Why AI guardrails backfire at the worst possible moment

The blunt instrument problem

What the research tells us

The spectrum of distress

A more sophisticated approach

The accountability gap

Building better

Related Articles

The teen suicide crisis and AI chatbots: What Congress is finally asking

The audit trail you can't see is the one that will destroy you