Erik Jenner comments on Reframing inner alignment

Erik Jenner 11 Dec 2022 20:11 UTC
LW: 8 AF: 6
2
AF
I’m pretty confused as to how some of the details of this post are meant to be interpreted, I’ll focus on my two main questions that would probably clear up the rest.
Reward Specification: Finding a policy-scoring function $J (π)$ such that (nearly–)optimal policies for that scoring function are desirable.
If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, “reward specification” would be responsible for detecting deceptive misalignment? If so, that seems very non-standard and should probably be called out more. In particular, all the problems usually clustered in inner alignment would then have to be solved as part of reward specification as far as I can tell. If this is not what you meant (i.e. if J just evaluates the policy on some distribution), then I don’t see how the proof step $| J (π) - {max}_{π^{*}} J (π^{*}) | < ϵ ⟹ safelyMitigatesXRisk (π)$ could hold.
If an “inner alignment” solution to Adequate Policy Learning is not invariant to extensional equivalence of policies, it is certainly not robust: formally, it cannot be continuous relative to any natural topology on policy-space, let alone Lipschitz-continuous relative to a natural metric on policy-space.
Can you say more about what you mean by this? By the solution being invariant to extensional equivalence, do you mean that two policies that are extensionally equal should have the same probability (density) of being sampled using our training procedure? That would seem like an extremely strong and unnecessary condition. For continuity, I guess you mean that some density of $T : Ω \to typeof (π)$ would be discontinuous if T wasn’t invariant? (Continuity of T itself seems irrelevant, even assuming $Ω$ has some natural topology) I don’t see why that is bad, or “non-robust” in any important sense. I currently understand your point here as: if the policy learning method isn’t invariant under extensional equivalence, then it will assign very different probabilities to policies that have almost the same (or even exactly the same behavior). But why would that be an issue? We just want one good policy, and if putting restrictions on its internal implementation makes that easier to find, why shouldn’t we do that? In fact, if we had a completely deterministic procedure that just spits out one out of the many good policies, that seems like a perfectly fine solution even though it’s maximally “discontinuous” (the density doesn’t even exist as a function).
- davidad 11 Dec 2022 20:23 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Thanks, this is very helpful feedback about what was confusing. Please do ask more questions if there are still more parts that are hard to interpret.
  
  To the first point, yes, $J$ evaluates $π$ on all trajectories, even off-distribution. It may do this in a Bayesian way, or a worst-case way. I claim that $J$ does not need to “detect deceptive misalignment” in any special way, and I’m not optimistic that progress on such detection is even particularly helpful, since incompetence can also be fatal, and deceptive misalignment could Red Queen Race ahead of the detector.
  
  Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. $J$ can detect this by simply detecting bad stuff.
  
  If there’s a sneaky hard part of Reward Specification beyond the obvious hard part of defining what’s good and bad, it would be “realistically defining the environment.” (That’s where purely predictive models come in.)
  - davidad 11 Dec 2022 20:38 UTC
    LW: 3 AF: 2
    0
    AF Parent
    It is perhaps also worth pointing out that computing $J$ is likely to be extremely hard. Step 1 is just about defining $J$ as a unique mathematical function; the problem of tractably computing guaranteed bounds on $J$ falls under step 2.
    
    My guess is that the latter will require some very fancy algorithms of the model-checking variety and some very fancy AI heuristics that plug into those algorithms in a way that can improve their performance without potentially corrupting their correctness (basically because they’re Las Vegas algorithms instead of Monte Carlo algorithms).
    - Erik Jenner 11 Dec 2022 21:11 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Thanks, computing J not being part of step 1 helps clear things up.
      I do think that “realistically defining the environment” is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.
      Similar to my comment in the other thread on deception detection, most of my optimism about alignment comes from worlds where we don’t need to define the environment in such a robust way (doing so just seems hopelessly intractable to me). Not sure whether we mostly disagree on the hardness of solving things within your decomposition, or on how good chances are we’ll be fine with less formal guarantees.
      - davidad 11 Dec 2022 21:40 UTC
        LW: 3 AF: 2
        0
        AF Parent
        To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously:
        
        unusually optimistic about formal verification
        unusually optimistic about learning interpretable world-models
        unusually pessimistic about learning interpretable end-to-end policies
      - davidad 11 Dec 2022 21:23 UTC
        LW: 2 AF: 1
        0
        AF Parent
        I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in $J$ , that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.
  - TurnTrout 20 Dec 2022 19:15 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. $J$ can detect this by simply detecting bad stuff.
    I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say “all’s fine.”
    - davidad 20 Dec 2022 19:46 UTC
      LW: 2 AF: 1
      0
      AF Parent
      For most $J$ s I agree, but the existence of any adversarial examples for $J$ would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)
- davidad 11 Dec 2022 20:30 UTC
  LW: 3 AF: 2
  0
  AF Parent
  To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that $T$ should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies).
  
  Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.
  - Erik Jenner 11 Dec 2022 21:08 UTC
    4 points
    0
    Parent
    FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don’t need one that’s quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn’t have ruled out every single policy that’s extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn’t find them anyway. The main point of a deception detector wouldn’t be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.
  - Erik Jenner 11 Dec 2022 20:43 UTC
    LW: 3 AF: 2
    2
    AF Parent
    I see, that makes much more sense than my guess, thanks!