TurnTrout comments on Don’t align agents to evaluations of plans

TurnTrout 28 Nov 2022 22:05 UTC
LW: 4 AF: 4
0
AF
Really appreciate the good questions!
If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?
No, there are other failure modes due to unnaturality. Here’s something I said in private communication:
Some of my unease comes from putting myself in the shoes of the grader.
```
>Be me
>time-limited simulation forced to grade how happy the world will be under some plan proposed by this monomaniacal "actor" AI that only wants my number to come out high
>tfw
>ok, whatever
>time to get to work
>...
>looks like this plan is pretty good
>helping at a soup kitchen
>holding doors open
>working with kids
>notbad
>what number do I give it?
>Uh... 20? 
>What if my other invocations got anchored on a higher number right before this process
>what's the shelling procedure
>...
>Maybe my credence that this is the most kind plan the actor can come up with
>hell, man, I don't know. 
>.1?
>(The variance and path-dependence on the schelling procedure is going to be crazy.)
```
This seems like another way the grader gets “outmaneuvered”, where a similarly sophisticated actor can abstractly peer down many execution paths and select for unwisdom and upwards errors
that it will need to pay a cost in compute for both generating and pruning the adversarial examples that will make it cost more than alternative architectures. Why do you think that this cost isn’t compensated by the fact that you’re searching over more plans and so have access to more “good options” too?
So, clarification: if I (not a grader-optimizer) wanted to become a grader-optimizer while pursuing my current goals, I’d need to harden my own evaluation procedures to keep up with my plan-search now being directed towards adversarial plan generation.
Furthermore, for a given designer-intended task (e.g. “make diamonds”), to achieve that with grader-optimization, the designer pays in the extra effort they need to harden the grader relative to just… not evaluating adversarial plans to begin with.
Given an already pointed-to/specified grader, the hardening is already baked in to that grader, and so both evaluation- and values-child should come out about the same in terms of compute usage.
I think that a values-executing AGI can also search over as many plans which actually make sense, I don’t think its options are limited or anything. But it’ll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. “don’t think about basilisks”).
You’re making strong claims about us needing to avoid as much as possible going on the route of grader optimization. Why do you expect that there is no clean/clear cut characterization of the set of adversarial plans (or a superset) that we could just forbid and then go on our merry way building grader optimizers?
1. I don’t see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. EDIT: I’m most confident in this point if you want your AI to propose plans which you can’t generate but can maybe verify.
2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.
- adamShimi 29 Nov 2022 11:03 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks for taking time to answer my questions in detail!
  About your example for other failure modes
  Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you’re pointing at the ability for the actor to “obfuscate” its plan in order to get high reward?
  If so, it’s not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to allow it to work on its main goal. But here its goal is to improve evaluation, and so confusion doesn’t seem like it helps.
  About the cost/competitiveness argument
  I think that a values-executing AGI can also search over as many plans which actually make sense, I don’t think its options are limited or anything. But it’ll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. “don’t think about basilisks”).
  This is the part where I currently cannot reconstruct your model (maybe because I haven’t read shard theory in a while). Because from an abstract perspective, the trick part of starting from significantly more limited plans is how to expand the range of plans without allowing adversarial ones. And I don’t have a model of what mechanism you think make it easier to go from the bottom up (from few plans accepted to more plans accepted) safely rather than from the top down (from all plans accepted to less plans accepted) safely.
  About the “negative result” of not even trying to do grader optimization
  I don’t see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide.
  See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.
  But why don’t you see that it should exist? What are the parts of your model that make this prediction? Because by definition, there is not a single innovation or progress that most people expected to exist. That’s what it means to figure out something new. So you need an actual reason for why you’re not falling into that trap of only being able to imagine what already exists.
  Here are some guesses:
  - Specific scientific claim that there should be no true name/clean concept of what is adversarial for a large class of cognitive architectures (in opposition to Natural Abstraction Hypothesis I guess?)
  - Human limitation claim that even though there might be such a true name/clear concept, we have neither the time not the ability to get it right under the current conditions
  - Security mindset claim that because the actor is allowed to get arbitrary smart, you should expect that if it optimizes against you, it wins.
  For the second point, what do you make of the claim that there are ridiculously few researchers who even attempted to work on the problem in the last 10 years, compared to almost all easier fields of knowledge? Do you think that this should not alter the evidence of absence you get from the absence of evidence?
  - TurnTrout 29 Nov 2022 17:13 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you’re pointing at the ability for the actor to “obfuscate” its plan in order to get high reward?
    No, the point is that the grader can only grade the current plan; it doesn’t automatically know what its counterfactual branches output. The grader is scope-limited to its current invocation. This makes consistent grading harder (e.g. the soup-kitchen plan vs political activism, neither invocation knows what would be given by the other call to the grader, so they can’t trivially agree on a consistent scale).