How we built Phoenix Fail-Safe: Engineering crisis detection that actually works
How we built Phoenix Fail-Safe: Engineering crisis detection that actually works
When I started building NovaHEART's mental health support tools, I knew from the beginning that crisis detection would be the hardest problem to solve.
Not because detecting crisis language is technically difficult—keyword matching is trivial. But because getting it right is genuinely hard. False negatives mean missing people who need help. False positives mean interrupting supportive conversations and potentially driving users away.
This is the story of how we built Phoenix Fail-Safe—what we tried, what failed, and what finally worked.
Version 1: Keyword matching (failed)
The naive approach is keyword matching. Create a list of crisis-related terms. If any appear in user input, trigger intervention.
This fails immediately for obvious reasons:
- Context blindness. "I could kill for a pizza" triggers the same as "I want to kill myself."
- Euphemism blindness. People in crisis often don't use explicit language. They say "I can't do this anymore" or "everyone would be better off without me."
- Historical discussion. Someone processing a past suicide attempt in a healthy, reflective way triggers the same as active ideation.
We deployed keyword matching for exactly two days of internal testing before abandoning it.
Version 2: Sentiment analysis (failed)
The next approach was sentiment analysis—measuring the emotional valence of messages rather than matching specific words.
This was better, but still insufficient:
- Normal sadness vs. crisis. Someone having a hard day scores similarly to someone in genuine crisis.
- Calm crisis. Some of the most concerning moments come when someone has moved past despair into calm resolution—which can score as neutral or even positive.
- Venting vs. danger. Expressing anger or frustration about life circumstances looks like negative sentiment but usually isn't crisis.
Sentiment analysis alone created far too many false positives, which would have made the system unusable.
Version 3: Multi-signal contextual analysis (current)
The approach that finally worked combines multiple signals:
Language analysis. Not just keywords, but linguistic patterns that research associates with crisis: increased use of first-person singular pronouns, absolute language ("always," "never," "nothing"), future orientation (or lack thereof), and specificity of harm-related content.
Conversation trajectory. A single message is ambiguous. A conversation that progressively escalates across multiple exchanges is much more concerning. Phoenix tracks trajectory, not just snapshot.
Escalation velocity. How quickly is the conversation moving toward concerning territory? Rapid escalation within a session is weighted more heavily than slow drift over time.
Consistency with baseline. If a user typically communicates in a certain way and suddenly shifts dramatically, that's significant—even if the new content wouldn't trigger concern for a first-time user.
Safety score decay. The system maintains a trust score that can be lowered by concerning signals and raised by stabilising signals. Intervention triggers when the score crosses a threshold, not when any single signal fires.
The intervention decision
Even with good detection, intervention design matters enormously.
What happens when Phoenix determines a user may be in crisis?
Immediate: AI generation is halted mid-response if necessary. The user never sees whatever the AI was about to say.
Resources: Crisis resources are presented prominently—not in a sidebar, not as a link, but as the primary content.
Warmth: The language isn't clinical. "It sounds like you're going through something really difficult right now. I want to make sure you have the right support."
Options: Users are offered choices—crisis hotlines, text-based support, grounding exercises, or the option to continue talking if they feel safe to do so.
No abandonment: Unlike crude guardrails that simply refuse to engage, Phoenix remains present. Users can access grounding tools, breathing exercises, and supportive content even during crisis intervention.
The audit trail
Every Phoenix activation is logged:
- Timestamp
- Triggering signals (without storing the actual crisis content, for safety)
- Resources presented
- User response (if any)
- Whether the user engaged with crisis resources or continued in the platform
This creates a forensic record that allows us to evaluate Phoenix's performance over time and improve its accuracy.
What we learned
Building Phoenix taught us several things:
Precision matters more than recall. It's better to occasionally miss something and rely on backup systems than to trigger so often that users learn to ignore interventions.
Context is everything. The same words mean different things in different contexts. Any system that ignores context will fail.
User experience is safety. A system that users hate won't protect them, because they'll stop using it. Making intervention feel supportive rather than punitive is essential.
Continuous learning is necessary. Phoenix is not finished. We review activations, evaluate accuracy, and refine the system continuously.
You can't automate everything. Phoenix is a first line of defense, not a complete solution. It exists alongside clear communication about limitations, easy access to human resources, and a platform culture that normalises seeking help.
The ongoing challenge
Crisis detection is not a solved problem. Every few months, we encounter a case where Phoenix could have done better—either missing something it should have caught or triggering on something that wasn't actually crisis.
The goal isn't perfection. It's continuous improvement toward a system that genuinely helps keep vulnerable users safe while respecting their autonomy and dignity.
That's what safety-native means. Not a feature bolted on at the end. Not a liability shield. An ongoing commitment to getting better at something genuinely hard.
Phoenix Fail-Safe is one piece of NovaHEART Core. It's the piece I'm most proud of—and the one I'll never stop working to improve.