For Policy’s Sake: Why We Must Distinguish AI Safety from AI Security in Regulatory Governance

Over the past few months, I’ve been conducting independent policy research focusing on understanding how AI safety concepts can be translated into AI regulation /​ policy governance.

I am a laywer working in AI Governance for a large European corporation, and I am providing feedback on regulatory initiatives from the European Comission. I do not have a ML background, which is precisely why I am here, seeking technical expertise.

In the context of this work, I’ve noticed recurring confusion around terminology, especially between what we mean by “AI Safety” vs. “AI Security”.

These terms often get blurred together in policy discourse, which can lead to misunderstandings and misplaced regulatory requirements.

MAIN ASK: For those of you with a technical background in AI Safety (engineering or research): Please, give me your take on how you believe these terms should be used!

This post outlines a simplified but actionable framework that I’ve found useful when mapping these concepts onto regulatory efforts (e.g., the EU AI Act).

Here is how I would classify what belongs under “AI Safety” policy efforts vs what should be seen as “AI Security” measures:

1. AI Safety: Protecting Humans from AI-generated Harms

Objective: Ensuring that AI systems behave in ways that avoid causing harm or unintended consequences to humans, society, or the environment.

What AI Safety Seeks to Protect: Human well-being, societal values, fundamental rights, and environmental integrity.

Core Concerns:

  • Alignment: Ensuring AI systems’ objectives and behaviors are in line with human intentions, values, and ethics.

  • Interpretability: Understanding how and why AI models reach their decisions, particularly through research avenues like mechanistic interpretability.

  • Preventing Catastrophic Failures: Anticipating and mitigating scenarios where AI could inadvertently cause large-scale harm.

  • Avoiding Unintended Behavior: Identifying and correcting subtle ways AI might deviate from intended purposes, even without malicious intent.

Examples of Typical Techniques:

  • Mechanistic Interpretability (think: Feature steering, sparse autoencoders or Dictionary learning)

  • Reinforcement Learning with Human Feedback (RLHF)

  • Constitutional AI

  • Scalable oversight mechanisms for “human in the loop” mandates.

Human Role: Humans as beneficiaries. AI Safety ensures AI remains beneficial and protective of human interests.

Core Question:
”Will this AI system unintentionally or intentionally harm me or others?”

Consequences if Safety Fails: Direct human harm ranging from physical injuries, misinformation, emotional manipulation, to potentially catastrophic societal risks.

Real-world Examples:

  • Preventing medical AI from giving harmful advice.

  • Ensuring autonomous vehicles don’t endanger pedestrians.

  • Avoiding algorithmic amplification of extremist content.

  • Preventing chatbots from offering harmful mental health advice

  • Avoiding deceptive or manipulative behavior in goal-directed agents

2. AI Security: Protecting AI from Malicious Human Actors

Objective: Defending AI systems and their data against intentional attacks, unauthorized access, theft, manipulation, or exploitation.

What AI Security Protects: The integrity of AI systems, their data, and intellectual property (e.g., model weights and proprietary algorithms).

Core Concerns:

  • Cybersecurity for AI: Protecting AI infrastructure from external attacks.

  • Adversarial Robustness: Defending AI systems against attacks specifically designed to mislead or deceive models.

  • Confidentiality and information security: Using techniques such as secure enclaves, encryption, differential privacy, and secure multiparty computation to protect sensitive data.

Examples of Typical Techniques:

  • Adversarial Robustness Training

  • Model Watermarking

  • Input Validation and Sanitization

  • Differential Privacy, Secure Multiparty Computation, Homomorphic Encryption.

Human Role: Humans here act as potential attackers, adversaries, or malicious users of the AI system.

Core Question:
”Can someone intentionally exploit, manipulate, or steal information from this AI system?”

Consequences if Security Fails:

  • Misuse or weaponization of AI systems by adversaries.

  • Breaches of proprietary data leading to competitive losses, confidentiality breaches, or indirect societal harms.

Real-world Examples:

  • Preventing model theft via API scraping or reverse-engineering

  • Defending facial recognition systems from adversarial patches or spoofing attacks

  • Stopping autonomous agents from self-replicating or self-exfiltrating code

  • Preventing jailbreaks of AI safety guardrails via obfuscation tricks

Where Safety and Security Intersect

It’s essential for regulatory AI Governance to acknowledge the overlap here: A security failure, such as an adversarial attack tricking a self-driving car into not recognizing pedestrians, is not just a security concern as it can become an immediate safety issue causing direct human harm.

Yet, despite this overlap, the fundamental intentions behind these two fields differ:

  • AI Safety primarily addresses direct human harm caused by AI’s internal behavior.

  • AI Security focuses on external threats exploiting AI systems, which can then indirectly cause harm.

Why This Distinction Matters for AI Governance

AI governance, particularly in regulatory contexts like the EU AI Act, explicitly aims at safeguarding individuals from AI-related harm.

The European AI Act defines its purpose in Art.1 as:

“Ensuring a high level of protection of health, safety, fundamental rights […] against the harmful effects of AI systems.”

Given this objective, I believe that regulatory frameworks should explicitly incorporate and incentivize not only AI Security but also AI Safety research, including alignment, interpretability, and control.

I’ve described these three areas to others like this in the past, so feel free to interject:

  • Alignment ensures AI outputs genuinely reflect human intentions and ethical standards. Without alignment, even secure systems might produce harmful outcomes.

  • Interpretability helps us directly investigate how AI models reason internally, allowing us to audit and improve alignment, beyond merely documenting outputs.

  • Control helps prevent models from producing unintended harmful behaviors in the first place.

Connecting AI Safety to Specific AI Act Provisions

To ground this in existing regulatory language, I will list a few provisions of the EU AI Act where AI Safety (rather than “security”) needs to be kept in mind:

Art.13, Art.14 and Art.15 of the European AI Act (relevant to AI Safety)

Art. 13 - Transparency Obligations:

“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately.”

This provision isn’t just about documentation, it calls for meaningful transparency into model behavior.

While traditional “explainability” tools (e.g., SHAP, LIME) offer surface-level insights, mechanistic interpretability aims to go further: it investigates the internal reasoning structures of the model (circuits, attention heads, representations) to explain why a model behaved a certain way, not just what it did.

“Sufficient transparency” is currently undefined. Without standards that include interpretability research, this requirement risks being satisfied by shallow explainability: presenting outputs with plausible reasoning, without surfacing the actual mechanisms behind them.

Art. 14 - Human Oversight:

“Human oversight shall aim to prevent or minimise the risks to health, safety or fundamental rights… in particular where such risks persist despite the application of other requirements.”

Human oversight is about preventing AI from causing harm despite compliance with other measures.

But for the underlying objective of minimising risks to health, safety or fundamendal rights, we still need model alignment: ensuring that systems produce outcomes consistent with human intent and values. It also hinges on interpretability, because oversight without insight is just observation.

“Such measures shall enable the oversight person to… correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available.”

Oversight is not just about who is watching, but how they’re empowered to understand and intervene. This refers directly to interpretability and control tools, methods that help humans not only interpret outputs, but intervene when the system behaves unexpectedly. Alignment research (e.g., RLHF, Constitutional AI) is foundational here, as are control techniques like steering via reward models, logit Regularization /​ Activation Steering or Rejection Sampling /​ Output Filtering.

Art. 15 - Accuracy, Robustness, and Cybersecurity:

Now, these are the core AI Security obligations explicitly covering adversarial robustness, cybersecurity, and data protection, critical for preventing external exploitation or manipulation of AI systems.

Also relevant: Art 55’s model evaluation obligations.

Why This Matters for Governance Translation

The distinctions I’ve outlined aren’t just for semantics. I find these useful when I think about governance frameworks and allocation of responsibilities among stakeholders:

When looking at a given problem and possible solutions, are we emphasizing the intended, positive outcome that we expect for humans (Safety) or are we focusing on the integrity of the AI system and the potential confidentiality tradeoffs arising from model auditing (security)?

Misunderstanding these terms leads to confusion and misplaced regulatory expectations, ultimately reducing the effectiveness of governance efforts.

Why This Distinction Might Feel Artificial or Limiting

As I present this simplified framework distinguishing AI Safety from AI Security, I anticipate (and welcome!) pushback, particularly from ML engineers and researchers.

Some valid critiques might include:

  • Overlapping Realities: In practical engineering, the distinction between safety and security often blurs. For instance, adversarial robustness (typically categorized as security) directly impacts safety, making strict categorization feel artificial or overly simplistic.

    Operational Constraints: Engineers might argue that, in reality, teams work simultaneously on security and safety. For example, an engineer improving model robustness might concurrently address alignment concerns, challenging the notion of separate domains.

    Risk of Silos: Creating rigid conceptual distinctions could inadvertently reinforce organizational silos, potentially hindering interdisciplinary collaboration that is crucial for addressing complex AI risks.

    Terminological Confusion: Some may find that introducing yet another set of distinctions adds to confusion rather than resolving it, particularly given the diverse usage of these terms across academia, policy circles, and industry.

We know that the complexities of real-world engineering and research rarely fit neatly into conceptual categories. My intention with this isn’t to ignore these overlaps or nuances.

My goal is simply to provide clarity and structure that supports policy-makers, regulatory professionals, and enterprise risk experts in translating technical insights into effective governance.

So, I am asking engineers, researchers, and security specialists to please challenge, critique, and refine this framework.

  • How might your practical experience refine these conceptual boundaries in ways that are both accurate and actionable?

  • Are there better ways to frame these distinctions that more authentically capture your day-to-day realities while still meeting governance needs?

The last thing we need is for Policy and AI governance to impose arbitrary distinctions.

But for governance to work, we need ways to bring key safety and security breakthroughs into policy, and that means understanding which research is relevant to which regulatory goal.

Your expertise is essential to get this right so, thank you in advance for any feedback you can provide!