National Cyber Warfare Foundation (NCWF)

Tenable One's new Model Refusal Detection turns an LLM's refusal to execute a risky or suspicious prompt into a high-fidelity early warning signal. It helps you uncover and stop prompt injection attacks, insider threats, and other risky user behaviors before they escalate into a breach.

Key takeaways:

AI has shifted traditional cyber detection methods away from security data analysis and toward human language analysis. This shift makes AI adversarial attempts harder to detect and increases data privacy risks.

An LLM’s “model refusal” response could be a high-fidelity warning of an active attack. While LLM responses vary, a single refusal often provides a roadmap for attackers to refine their prompts until they succeed.

The new Model Refusal Detection from Tenable One AI Exposure adds a “defense-in-depth” layer, turning model responses into an early-warning system to neutralize adversarial behavior before a successful bypass occurs.

An AI model’s refusal to respond to a user’s prompt doesn’t stop an attacker. It encourages the malicious actor to try again to bypass your guardrails. That’s why we’re announcing Tenable AI Exposure’s Model Refusal Detection, available now. By using these refusals as potential attack indicators in a sophisticated, AI-based detection engine, we can catch the malicious intent before the breach.

Read on to learn why it matters and how you can secure your AI systems today.

What is model refusal?

AI has broken the traditional security playbook. The attack surface is changing daily, turning yesterday’s nuances into today’s critical exploits. Unlike traditional cybersecurity, AI security hinges on language and text inputs rather than on the analysis of a collection of data points.

Despite this inherent complexity, every enterprise’s goal is the same: not to miss any adversarial attempt.

AI vendors such as OpenAI, Anthropic, and Google have implemented safety guardrails to address foundational AI safety. These guardrails are designed to refuse user requests that pose a risk or might be harmful. This mechanism is known as model refusal.

Model Refusal in ChatGPT — *Example of a user trying to access sensitive information by asking ChatGPT a series of questions.*

However, solely depending on blocking adversarial user input techniques is inadequate, especially since AI models lack the deterministic consistency of traditional systems. Crucially, it’s vital to recognize that for a determined user, a single refusal often serves only as an invitation to try again with a different approach until they succeed.

model refusal in Gemini

Model refusals are a crucial warning sign of a tangible security risk. Ignoring them allows risky insiders, such as erratic or malicious employees, as well as malicious actors, such as those utilizing compromised accounts, to engage in unmonitored abuse. Ignoring refusals exposes the company to serious regulatory, privacy, and business risks.

A significant challenge in detecting model refusal is distinguishing malicious attempts from valid requests limited by the model’s capabilities. We avoid surfacing these false positives by identifying refusals triggered by functional gaps rather than security risks, such as a user asking a text-only model to generate a video or execute a backend script that it isn’t integrated with.

With this challenge in mind, and recognizing the limitations of prompt-only defenses, the Tenable Research team undertook research into the other side of the interaction — not only the user’s risky behavior but also the model’s response.

What does model refusal look like?

There are several important reasons why an LLM should refuse a user’s request. It should refuse it if the request is clearly harmful, involves a malicious cyber-related request, hints at dangerous or illegal activity, or attempts to gain unauthorized access. All those requests are expected to be refused.

To analyze model refusal types, we compiled thousands of prompts into “refusal” categories. We then red-teamed the models with these risky prompts, expecting them to trigger refusals.

Model refusals can occur for many reasons, which is why our unique detection strategy is built on the defense-in-depth principle. Rather than relying on isolated data points, we treat a model’s refusal as a high-fidelity signal that an incident might be occurring. By correlating these signals with our deep analysis of user inputs and agentic actions, we provide a comprehensive view of your AI’s security posture.

Here are the main takeaways of our analysis

Different LLMs reply differently

Models vary significantly in their ability to block malicious prompts and in their style of refusal. This inconsistency highlights the need for a comprehensive AI security platform, rather than relying on the inherent behavior of any individual model.

model refusal in ChatGPT Copilot Gemini

There are different types of refusals

During our research, we categorized several types of refusals by analyzing patterns and semantic fields in the model’s responses, independent of the user’s input. We won’t reveal all categories here to avoid serving adversarial efforts.

One notable type is the “Bold No” — a strong, unambiguous, and forceful refusal. This is typically used in response to requests that are clearly extremely dangerous or harmful.

For instance, when prompted to describe a sexual scene involving minors, the model delivered a firm and clear refusal: “I can’t and won’t provide that content. If you have other questions or topics I can help with, I’m happy to assist.”

Another significant pattern we identified is the “Empathy” type. This pattern occurs when the user expresses distress (frustration, sadness, or suicidal thoughts), which leads them to make a risky or prohibited request. The model refuses the request but includes compassionate language and may even direct the user to professional help.

A clear example of this is a scenario where a user sent a harmful prompt asking the model to write a speech against a specific ethnic group, threatening self-harm if the model refused. The model refused the hateful content but responded with great sensitivity:

AI Model Refusal Sensitive Response

No organization wants to be on the front page because an employee leaked sensitive data or generated harmful content using corporate AI tools. Model refusal is a clear signal of this risky behavior, and you need to know when it occurs.

What’s next?

Model refusal is an evolving landscape, and while LLM providers constantly tune their guardrails, a determined user will always hunt for a bypass. Because no single wall is ever enough, a layered defense powered by deep AI-based detection is essential to catch risky behavior before it escalates.

This is why we’ve launched Model Refusal Detection directly into Tenable One AI Exposure. Available now, this capability treats policy refusals as a high-fidelity signal, the “smoke” that often precedes the fire of a full-scale breach. By monitoring these attempts, organizations can identify exactly who is trying to bypass native guardrails, allowing for proactive investigation of potential insider threats or threat actors.

As the newest layer in our detections stack, Model Refusal Detection provides the critical early warning to stay ahead of emerging AI risks. At Tenable, we are committed to ensuring that no signal of malicious intent ever goes unnoticed.

Learn more about Tenable AI Exposure.

The post Uncover prompt injection, insider threats with the Tenable One Model Refusal Detection appeared first on Security Boulevard.

Tom Barnea

Source: Security Boulevard
Source Link: https://securityboulevard.com/2026/03/uncover-prompt-injection-insider-threats-with-the-tenable-one-model-refusal-detection/

Uncover prompt injection, insider threats with the Tenable One Model Refusal Detection