Koen.Holtman comments on Disentangling Corrigibility: 2015-2021

Koen.Holtman 6 Apr 2021 19:13 UTC
LW: 0 AF: -1
AF

My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)

You writing it quickly in half an hour also explains a lot about how it reads.

it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.

I guess we have established by now that the paper is not about your version of intuitive-corrigibility.

For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.

You interpret the abstract as follows:

You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.

Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.

[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”

The phrasing ‘works as intended’ in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of ‘intuitive corrigibility’.

So I am guessing you did not pick up on that when reading the abstract.

OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:

But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?

In the paper I don’t try to make the agent’s world model distinguish between ‘humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’. The ‘works as intended’ is that any button press for whatever reason is supposed to stop the agent.

So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’ discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution,

You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.

What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.

and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?

Here is a paperclip optimizer example of ‘ensuring that the agent environment never goes off-distribution’. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.

In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.

Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.

For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.

I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.