TurnTrout comments on Disentangling Corrigibility: 2015-2021

TurnTrout 5 Apr 2021 15:03 UTC
LW: 10 AF: 5
AF
where these people feel the need to express their objections even before reading the full paper itself
I’d very much like to flag that my comment isn’t meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)
Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.
Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:
“That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.”
Curious. So here is some speculation.
Perhaps I could have flagged this so you would have realized it wasn’t meant as a “rhetorical move”: it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility. From the abstract:
A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI).
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started… [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
This does not parse like a normal-strength mathematical claim. This is a claim about the de facto post-deployment safety properties of “arbitrarily advanced utility maximizing agents.”
Again, I’m not saying that your paper doesn’t have any good contributions. I can’t judge that without further reading. But I am standing by my statement that this is a non-standard claim which I’m skeptical of and which makes me hesitate to read the rest of the paper.
We know very well ‘how to do this’ for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today’s actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.
Yes, we know how to do it for existing AI agents. But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
And even that is an unrealistically structured scenario, since it seems like prosaic AGI is quite plausible. Prosaic AGI would be way messier than AIXI, since it wouldn’t be doing anything as clean as Bayes-updating the simplicity prior to optimize an explicit utility function.
Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.
This is not going to happen for AGI, since we might not survive testing on-distribution, and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?
You seem to be looking for ‘not very narrow sense’ corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don’t build a system like that if you want to be safe.
I’m not just talking about that; the above shows how symbol grounding is tough even for seemingly well-defined events like “is the off-switch being pressed?”, without any fancy self-improvement.
- Koen.Holtman 6 Apr 2021 19:13 UTC
  LW: 0 AF: -1
  AF Parent
  
  My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
  
  OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.
  
  (I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)
  
  You writing it quickly in half an hour also explains a lot about how it reads.
  
  it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.
  
  I guess we have established by now that the paper is not about your version of intuitive-corrigibility.
  
  For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.
  
  You interpret the abstract as follows:
  
  You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.
  
  Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.
  
  [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
  
  The phrasing ‘works as intended’ in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of ‘intuitive corrigibility’.
  
  So I am guessing you did not pick up on that when reading the abstract.
  
  OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:
  
  But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
  
  In the paper I don’t try to make the agent’s world model distinguish between ‘humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’. The ‘works as intended’ is that any button press for whatever reason is supposed to stop the agent.
  
  So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’ discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.
  
  Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.
  
  This is not going to happen for AGI, since we might not survive testing on-distribution,
  
  You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.
  
  What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.
  
  and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?
  
  Here is a paperclip optimizer example of ‘ensuring that the agent environment never goes off-distribution’. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.
  
  In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.
  
  Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.
  
  For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.
  
  I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.