L Rudolf L comments on Difficulty classes for alignment properties

L Rudolf L 20 Feb 2024 20:27 UTC
7 points
6
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment.
The first part here feels unfair to the deceived. The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can’t have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place. The core intuition is that if we have the latter, I assume we’ll eventually get the former through better models (though I think there’s a decent chance that control works for a long time, and there you care specifically about whether complex environment interactions lead to deception succeeding or not, but I don’t think that’s what you mean?).
The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: “if”, not “if and only if”) (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don’t massively reduce the probability of that strategy, or RLHFed constraints against being bad aren’t enough). And here, (2) is of course about the environment. But to see whether this argument goes through, it doesn’t seem like we need to care all that much about the real-world environment (as opposed to toy settings), because “does the real world incentivize deception” seems much less cruxy than (1) or (3).
So my (weakly held) claim is that you can study whether deception emerges in sufficiently simple environments that the environment complexity isn’t a core problem. This will not let you determine whether a particular output in a complicated environment is part of a deceptive plan, but it should be fairly good evidence of whether or not deception is a problem at all.
(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it’s better to not overload the term.)
- Jozdien 21 Feb 2024 1:42 UTC
  3 points
  0
  Parent
  The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: “if”, not “if and only if”) (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don’t massively reduce the probability of that strategy, or RLHFed constraints against being bad aren’t enough).
  Yep, that’s what I meant. I go into it in more detail in this post, or you could read Nate’s post on deep deceptiveness.
  However, I think studying this in practice becomes much more of a quantitative question. I think this argument is clearly true in the limit. So, relevant factors for practical contexts become how powerful the AI system is, how strongly the environment incentivizes deception, and how many complex levers you can easily pull in the environment.
  For example, compare the settings of inserting backdoors into code in the control agenda and writing explanations for why a protein recipe is safe. I think there’s a pretty large difference in how likely I expect each setting to result in deception. Concretely:
  - The “environment” in the second context is much richer. There are more actions that an AI system could take that interact with the environment in complex ways resulting in deception. There’s plausibly a qualitative threshold above which actions like “Use this existing service that does the thing you want without wondering how it does it” are possible/easy to find.
  - At least in the control agenda, part of the case for optimism in the former case relies on systems not being wildly competent before we can use them to solve alignment. Plausibly this is below the threshold of which it can efficiently search through actions in the environment to get high performance, but I’m not super certain.
  (I go into more detail on stuff like this relating to the control agenda in this post; specifically section 3. I do think the argument generalizes to why I think there’s slightly less chance that control works, but it’s not central to this post.)
  I agree that this would give you non-zero evidence, however. I think we already have some evidence that looks sort of like this. From another post of mine:
  Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms^[1]. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent.
  Eventually, organisms evolved to recognize and pause their replication in the test environment and resume normal replication when released. Ofria tried solving this by making it harder to detect the test environment, but the organisms simply started to randomly accelerate their replication so as to slip through the test environment some of the time.
  I don’t think this is very strong evidence for whether we should expect similar things to pop up in realistic situations—the parameters / relevant factors are pretty different. But I’m left with uncertainty over how likely it is to pop up, and not whether it’s a problem at all. And if you actively apply selection pressure against one kind of deception, I think it becomes more likely.
  The first part here feels unfair to the deceived.
  I’m not sure what you mean by this?
  The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can’t have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place.
  That it’s too high a bar is exactly what I’m saying. It’s possible in theory to come up with some complicated platonic measure of deception such that you detect when it’s occurring in a system^[1], but that’s hard.
  (Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it’s better to not overload the term.)
  I meant something more informal, yeah. I was wary of conflating terms together like that, but I couldn’t immediately think of another phrase that conveyed the idea I was trying to get across.
  1. ^
    If it’s possible to robustly identify deceptive circuits within a system, that’s because there’s some boundary that separates it from other kinds of circuits that we can leverage. Generalizing, there’s plausibly some conceptual boundary that separates mechanics in the environment that leads to deception from ones that don’t.