For Policy’s Sake: Why We Must Distinguish AI Safety from AI Security in Regulatory Governance

Katalina HernandezApr 4, 2025, 9:16 AM

6 points

Over the past few months, I’ve been conducting independent policy research focusing on understanding how AI safety concepts can be translated into AI regulation / policy governance.

I am a laywer working in AI Governance for a large European corporation, and I am providing feedback on regulatory initiatives from the European Comission. I do not have a ML background, which is precisely why I am here, seeking technical expertise.

In the context of this work, I’ve noticed recurring confusion around terminology, especially between what we mean by “AI Safety” vs. “AI Security”.

These terms often get blurred together in policy discourse, which can lead to misunderstandings and misplaced regulatory requirements.

MAIN ASK: For those of you with a technical background in AI Safety (engineering or research): Please, give me your take on how you believe these terms should be used!

This post outlines a simplified but actionable framework that I’ve found useful when mapping these concepts onto regulatory efforts (e.g., the EU AI Act).

Here is how I would classify what belongs under “AI Safety” policy efforts vs what should be seen as “AI Security” measures:

1. AI Safety: Protecting Humans from AI-generated Harms

Objective: Ensuring that AI systems behave in ways that avoid causing harm or unintended consequences to humans, society, or the environment.

What AI Safety Seeks to Protect: Human well-being, societal values, fundamental rights, and environmental integrity.

Core Concerns:

Alignment: Ensuring AI systems’ objectives and behaviors are in line with human intentions, values, and ethics.
Interpretability: Understanding how and why AI models reach their decisions, particularly through research avenues like mechanistic interpretability.
Preventing Catastrophic Failures: Anticipating and mitigating scenarios where AI could inadvertently cause large-scale harm.
Avoiding Unintended Behavior: Identifying and correcting subtle ways AI might deviate from intended purposes, even without malicious intent.

Examples of Typical Techniques:

Mechanistic Interpretability (think: Feature steering, sparse autoencoders or Dictionary learning)
Reinforcement Learning with Human Feedback (RLHF)
Constitutional AI
Scalable oversight mechanisms for “human in the loop” mandates.

Human Role: Humans as beneficiaries. AI Safety ensures AI remains beneficial and protective of human interests.

Core Question:
”Will this AI system unintentionally or intentionally harm me or others?”

Consequences if Safety Fails: Direct human harm ranging from physical injuries, misinformation, emotional manipulation, to potentially catastrophic societal risks.

Real-world Examples:

Preventing medical AI from giving harmful advice.
Ensuring autonomous vehicles don’t endanger pedestrians.
Avoiding algorithmic amplification of extremist content.
Preventing chatbots from offering harmful mental health advice
Avoiding deceptive or manipulative behavior in goal-directed agents

2. AI Security: Protecting AI from Malicious Human Actors

Objective: Defending AI systems and their data against intentional attacks, unauthorized access, theft, manipulation, or exploitation.

What AI Security Protects: The integrity of AI systems, their data, and intellectual property (e.g., model weights and proprietary algorithms).

Core Concerns:

Cybersecurity for AI: Protecting AI infrastructure from external attacks.
Adversarial Robustness: Defending AI systems against attacks specifically designed to mislead or deceive models.
Confidentiality and information security: Using techniques such as secure enclaves, encryption, differential privacy, and secure multiparty computation to protect sensitive data.

Examples of Typical Techniques:

Adversarial Robustness Training
Model Watermarking
Input Validation and Sanitization
Differential Privacy, Secure Multiparty Computation, Homomorphic Encryption.

Human Role: Humans here act as potential attackers, adversaries, or malicious users of the AI system.

Core Question:
”Can someone intentionally exploit, manipulate, or steal information from this AI system?”

Consequences if Security Fails:

Misuse or weaponization of AI systems by adversaries.
Breaches of proprietary data leading to competitive losses, confidentiality breaches, or indirect societal harms.

Real-world Examples:

Preventing model theft via API scraping or reverse-engineering
Defending facial recognition systems from adversarial patches or spoofing attacks
Stopping autonomous agents from self-replicating or self-exfiltrating code
Preventing jailbreaks of AI safety guardrails via obfuscation tricks

Where Safety and Security Intersect

It’s essential for regulatory AI Governance to acknowledge the overlap here: A security failure, such as an adversarial attack tricking a self-driving car into not recognizing pedestrians, is not just a security concern as it can become an immediate safety issue causing direct human harm.

Yet, despite this overlap, the fundamental intentions behind these two fields differ:

AI Safety primarily addresses direct human harm caused by AI’s internal behavior.
AI Security focuses on external threats exploiting AI systems, which can then indirectly cause harm.

Why This Distinction Matters for AI Governance

AI governance, particularly in regulatory contexts like the EU AI Act, explicitly aims at safeguarding individuals from AI-related harm.

The European AI Act defines its purpose in Art.1 as:

“Ensuring a high level of protection of health, safety, fundamental rights […] against the harmful effects of AI systems.”

Given this objective, I believe that regulatory frameworks should explicitly incorporate and incentivize not only AI Security but also AI Safety research, including alignment, interpretability, and control.

I’ve described these three areas to others like this in the past, so feel free to interject:

Alignment ensures AI outputs genuinely reflect human intentions and ethical standards. Without alignment, even secure systems might produce harmful outcomes.
Interpretability helps us directly investigate how AI models reason internally, allowing us to audit and improve alignment, beyond merely documenting outputs.
Control helps prevent models from producing unintended harmful behaviors in the first place.

Connecting AI Safety to Specific AI Act Provisions

To ground this in existing regulatory language, I will list a few provisions of the EU AI Act where AI Safety (rather than “security”) needs to be kept in mind:

Art.13, Art.14 and Art.15 of the European AI Act (relevant to AI Safety)

Art. 13 - Transparency Obligations:

“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately.”

This provision isn’t just about documentation, it calls for meaningful transparency into model behavior.

While traditional “explainability” tools (e.g., SHAP, LIME) offer surface-level insights, mechanistic interpretability aims to go further: it investigates the internal reasoning structures of the model (circuits, attention heads, representations) to explain why a model behaved a certain way, not just what it did.

“Sufficient transparency” is currently undefined. Without standards that include interpretability research, this requirement risks being satisfied by shallow explainability: presenting outputs with plausible reasoning, without surfacing the actual mechanisms behind them.

Art. 14 - Human Oversight:

“Human oversight shall aim to prevent or minimise the risks to health, safety or fundamental rights… in particular where such risks persist despite the application of other requirements.”

Human oversight is about preventing AI from causing harm despite compliance with other measures.

But for the underlying objective of minimising risks to health, safety or fundamendal rights, we still need model alignment: ensuring that systems produce outcomes consistent with human intent and values. It also hinges on interpretability, because oversight without insight is just observation.

“Such measures shall enable the oversight person to… correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available.”

Oversight is not just about who is watching, but how they’re empowered to understand and intervene. This refers directly to interpretability and control tools, methods that help humans not only interpret outputs, but intervene when the system behaves unexpectedly. Alignment research (e.g., RLHF, Constitutional AI) is foundational here, as are control techniques like steering via reward models, logit Regularization / Activation Steering or Rejection Sampling / Output Filtering.

Art. 15 - Accuracy, Robustness, and Cybersecurity:

Now, these are the core AI Security obligations explicitly covering adversarial robustness, cybersecurity, and data protection, critical for preventing external exploitation or manipulation of AI systems.

Also relevant: Art 55’s model evaluation obligations.

Why This Matters for Governance Translation

The distinctions I’ve outlined aren’t just for semantics. I find these useful when I think about governance frameworks and allocation of responsibilities among stakeholders:

When looking at a given problem and possible solutions, are we emphasizing the intended, positive outcome that we expect for humans (Safety) or are we focusing on the integrity of the AI system and the potential confidentiality tradeoffs arising from model auditing (security)?

Misunderstanding these terms leads to confusion and misplaced regulatory expectations, ultimately reducing the effectiveness of governance efforts.

Why This Distinction Might Feel Artificial or Limiting

As I present this simplified framework distinguishing AI Safety from AI Security, I anticipate (and welcome!) pushback, particularly from ML engineers and researchers.

Some valid critiques might include:

Overlapping Realities: In practical engineering, the distinction between safety and security often blurs. For instance, adversarial robustness (typically categorized as security) directly impacts safety, making strict categorization feel artificial or overly simplistic.
Operational Constraints: Engineers might argue that, in reality, teams work simultaneously on security and safety. For example, an engineer improving model robustness might concurrently address alignment concerns, challenging the notion of separate domains.
Risk of Silos: Creating rigid conceptual distinctions could inadvertently reinforce organizational silos, potentially hindering interdisciplinary collaboration that is crucial for addressing complex AI risks.
Terminological Confusion: Some may find that introducing yet another set of distinctions adds to confusion rather than resolving it, particularly given the diverse usage of these terms across academia, policy circles, and industry.

We know that the complexities of real-world engineering and research rarely fit neatly into conceptual categories. My intention with this isn’t to ignore these overlaps or nuances.

My goal is simply to provide clarity and structure that supports policy-makers, regulatory professionals, and enterprise risk experts in translating technical insights into effective governance.

So, I am asking engineers, researchers, and security specialists to please challenge, critique, and refine this framework.

How might your practical experience refine these conceptual boundaries in ways that are both accurate and actionable?
Are there better ways to frame these distinctions that more authentically capture your day-to-day realities while still meeting governance needs?

The last thing we need is for Policy and AI governance to impose arbitrary distinctions.

But for governance to work, we need ways to bring key safety and security breakthroughs into policy, and that means understanding which research is relevant to which regulatory goal.

Your expertise is essential to get this right so, thank you in advance for any feedback you can provide!

What links here?

Insights from a Lawyer turned AI Safety researcher (ShortForm) by Katalina Hernandez (Mar 3, 2025, 7:14 PM; 1 point)

Katalina HernandezApr 4, 2025, 9:16 AM

6 points

11 comments6 min readLW link

AI Governance Regulation and AI Risk AI

Katalina Hernandez Apr 5, 2025, 6:33 PM
2 points
1

Alright, this is the second time now. What am I doing wrong, LessWrongers? :/.
- winstonBosan Apr 5, 2025, 7:01 PM
  3 points
  0
  Parent
  
  I will bite.
  First of all, I appreciate the effort of trying to communicate better and hammering down the neat borders of word and how they used across domains—especially for words that are often used interchangeably and carelessly.
  
  TLDR: Sometimes posts just get unlucky! And your style is on the verbose side and I am still somewhat confused about your value props.
  
  It seems like your frustration is from a lack of responses—often, a lack of response might just be luck based and how the LW algorithm works (exponential time decay). Maybe you posted at a time that most forum users are asleep, and/or you are running against the headwind of literally one of the most popular article of all time. Sometimes, you just get unlucky! Twice even!
  However, I can also say that the content is not written very legibly. It took me a long time to get the actual punch line and understand what you really want—“Can someone technical say something about how you want these terms to be used?”.
  
  In addition, the post is written very verbosely. It takes a long time to get to the point, it is not clear about what it wants until the very end. It doesn’t say how you are going to this, or that if it is even worth engaging with you because it is not clear how you’d help to delineate the word boundary from your position. I am still unsure about the exact value proposition on how better delineating these words would lead to reduced p(doom).
  
  Keep trying!
  - Katalina Hernandez Apr 5, 2025, 8:36 PM
    3 points
    0
    Parent
    
    Thanks so much for the thoughtful feedback! You’re absolutely right about the verbosity (part of the lawyer curse, I’m afraid) but that’s exactly why I’m here.
    I really value input from people working closer to the technical foundations, and I’ll absolutely work on tightening the structure and making my core ask more legible.
    You actually nailed the question I was trying to pose:
    “Can someone technical clarify how they believe these terms should be used?”
    As for why I’m asking: I work in AI Governance for a multinational, and I also contribute feedback to regulatory initiatives adjacent to the European Commission (as part of independent policy research).
    One challenge I’ve repeatedly encountered is that regulators often lump safety and security into one conceptual bucket. This creates risks of misclassification, like treating adversarial testing purely as a security concern, when the intent may be safety-critical (e.g., avoiding human harm).
    So, my goal here was to open a conversation that helps bridge technical intuitions from the AI safety community into actionable regulatory framing.
    I don’t want to just map these concepts onto compliance checklists, I want to understand how to reflect technical nuance in policy language without oversimplifying or misleading.
    I’ll revise the post to be more concise and frontload the value proposition. And if you’re open to it, I’d love your thoughts on how I could improve specific parts.
    Thanks again, this kind of feedback is exactly what I was hoping for!
- Knight Lee Apr 5, 2025, 9:30 PM
  0 points
  0
  Parent
  
  People on LessWrong aren’t driven by motivation to do useful work, but the craving to read something amusing or say something smart and witty.
  You are asking them to do useful work by giving you important advice, which requires more self control than people here have.
  Maybe instead of asking for feedback on your entire framework, it will be motivationally easier for them if you divide it into smaller bite sized Question Posts, and ask one every few days?
  You can always hide background information and context in collapsible sections.
  You can use multiple collapsible sections, one for each background info topic, so people can skip the ones which they already know about and which bore them.
  Anything you say outside a collapsible section should not refer to anything inside a collapsible section, which forces people to read the collapsible section and defeats the purpose.
  An alternative to collapsible sections is linking to you previous posts, but that only works if your previous posts fit well with your current post.
  Typo
  “What AI Safety Seeks to Prevent: Human well-being, societal values, fundamental rights, and environmental integrity.”
  Maybe change “prevent” to “protect.” Grumpy old users discriminate against new users, and a mere typo near the start can “confirm their suspicions” this is another low quality post.
  Writing like a lawyer
  Lawyers and successful bloggers/authors have the opposite instincts: lawyers try to be as thorough as possible while successful bloggers try to convey a message in as few words as possible.
  - Katalina Hernandez Apr 5, 2025, 10:09 PM
    2 points
    0
    Parent
    
    Thank you! This helps me a lot. I will hide the bits about the AI Act in collapsible sections, and I will correct this typo.
    
    One thing I’ve noticed though: most “successful” posts in LW are quite long and detailed, almost paper-length. I thought that by making my post shorter, I may lose nuance.
    - Knight Lee Apr 5, 2025, 10:21 PM
      2 points
      0
      Parent
      
      People’s attention spans vary dramatically when the topic is something cool and amusing, but I my vague opinion is that important policy work is necessarily a little less cool.
      I could be completely wrong. I haven’t succeeded in writing good posts either. So please don’t take my advice too seriously! I forgot to give this disclaimer last time.
      Random note: LessWrong has its internal jargon, where they talk about “AI Notkilleveryoneism.”
      The reason is that the words “AI safety” and “AI alignment” has been heavily abused by organizations doing Safetywashing. See some of the discussion here.^[1]
      ^
      I’m not saying you should adopt the term “AI Notkilleveryoneism,” since policymakers might laugh at it. But it doesn’t hurt to learn about this drama.
      - Katalina Hernandez Apr 5, 2025, 10:43 PM
        2 points
        1
        Parent
        
        Policy work is 100% less cool XD. But it should be concerning for us all that a vast majority of policy makers I’ve talked to did not even know that such thing as “mechanistic interpretability” exists, and think that alignment is some sort of security ideal…
        So what I am doing here may be a necessary evil.
        Knight Lee Apr 5, 2025, 10:57 PM
        1 point
        0
        Parent
        
        Hmm! If you talk to policymakers face to face, that is something you can leverage to get LessWrong folks interested in you!
        How many have you talked to, and how many will you talk to, and on what level are they on?
        You might make a short Question Post highlighting that you’re a lawyer talking to some policymakers. Even if it’s a small number and they are low level, a little bit is far better than nothing.
        Then you might ask an open-ended question, “what should I say to them?”
        And you can include what you are currently saying to them in a collapsed sections or link to previous posts. Maybe summaries of past conversations can be put in collapsed sections too.
        I’m not sure. I’m definitely not an expert in good LW posts, but my intuition is this one might get a better response.
        My feeling is a ton of people on LessWrong are dying to make their message known to policymakers but they are falling on deaf ears. (Arguably, I am one of them, I once wrote this, and got ignored when I cold-emailed policymakers)
        Someone who actually talks to policymakers (albeit European ones… haha) would be the most welcome.
        Katalina Hernandez Apr 5, 2025, 11:24 PM
        2 points
        0
        Parent
        
        “Albeit European ones” I laughed so much hahaha. Sorry to dissapoint XD. Yes, mainly EU and UK based. Members of the European Commission’s expert panel (I am a member too but I only joined very recently) and influential “think tanks” here in europe that provide feedback on regulatory initiatives, like the GPAI Codes of Practice
        I will read your post, btw! I am sick of shallow AI Risk statements based on product safety legislations that do not account for the evolving, unpredictable nature of AI risk. Oh well.
        I will gather more ideas and will post a Quick take as you’ve advised, that was a great idea, thank you!
        Knight Lee Apr 5, 2025, 11:33 PM
        1 point
        0
        Parent
        
        :) that’s great.
        I think you are very modest, and have a tendency to undersell the influence you have. Don’t do that in your quick take or post, make it clear from the beginning what positions you are in, and who you get to interact with :D
Katalina Hernandez Apr 4, 2025, 9:24 AM
2 points
0

I’m aware this “safety vs. security” distinction isn’t clean in real-world ML work (e.g., I understand that adversarial robustness spans both).
But it’s proven useful for communicating with policy teams who are trying to assign accountability across domains.
I’m not arguing against existential AI Safety framing, just using the regulatory lens where “safety” often maps to preventing tangible human harms, and “security” refers to model integrity and defense against malicious actors.
If you’ve found better framings or language that have worked across engineering/policy interfaces, I’d love to hear them.
Especially if you think interpretability or control work gets misclassified in governance discourse.
Grateful for your thoughts, please tell me where this falls short of your technical experience.