My comment was primarily judging your abstract and why it made me
feel weird/hesitant to read the paper. The abstract is short, but it
is important to optimize so that your hard work gets the proper
attention!
OK, that clarifies your stance. You feeling weird definitely created
a weird vibe in the narrative structure of your comment, a vibe that I
picked up on.
(I had about half an hour at the time; I read about 6 pages of your
paper to make sure I wasn’t totally off-base, and then spent the rest
of the time composing a reply.)
You writing it quickly in half an hour also explains a lot about
how it reads.
it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.
I guess we have established by now that the paper is not about your
version of intuitive-corrigibility.
For my analysis of intuitive-corrigibility, see the contents of the post
above. My analysis is that intuitions on corrigibility are highly
diverse, and have gotten even more diverse and divergent over time.
You interpret the abstract as follows:
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.
Yes that is what I am saying in the abstract. Your light rephrasing
where you add [significantly increases the probability that] indeed
expresses the message I intended to convey.
[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
The phrasing ‘works as intended’ in the abstract is supposed to
indicate clearly that the layer is designed to produce specific
such-and-such formal corrigibility properties only, not some broad
idea of ‘intuitive corrigibility’.
So I am guessing you did not pick up on that when reading the abstract.
OK, moving away from a discussion about abstracts, initial impressions,
feelings and their causes, moving towards a discussion of more technical
stuff:
But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
In the paper I don’t try to make the agent’s world model distinguish between
‘humans are pressing the button’ versus ‘the set of worlds in which a
rock fell on the button’. The ‘works as intended’ is that any button
press for whatever reason is supposed to stop the agent.
So this is a design tradeoff I am making.
Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’
discrimination system will work reliably enough in every possible future
emergency. The design tradeoff is to accept the risk that a stray
rock will stop the agent accidentally, potentially causing some level
of inconvenience.
Even with black-box learned world models, high levels of robustness
can be achieved by a regime of testing on-distribution and then
ensuring that the agent environment never goes off-distribution.
This is not going to happen for AGI, since we might not survive
testing on-distribution,
You are saying we might not survive AGI. This kind of observation
is almost a tautology, and therefore not very interesting.
What interests me is identifying techniques that can increase the
chance of survival. Drive down the risk. Admit that zero risk is
impossible and move on. There are plenty of risk-reduction techniques
out there if you look.
and how would we ensure that the environment
“stays on-distribution”? Is that like, pausing the world forever?
Here is a paperclip optimizer example of ‘ensuring that the agent
environment never goes off-distribution’. Your design goals are to 1)
keep your paperclip optimizing AGI agent inside your paperclip
factory, and also 2) to make sure that crazy people and/or radically
new technologies do not manage to get into the factory.
In other words, keep the environment inside the factory
on-distribution, and keep the agent in the factory.
Yes, you will never perfectly achieve these two goals with 100%
certainty. Your goal is to drive down the risk of failure on both of
them.
For goal 1), there are many techniques that come to mind, you probably
want to use them all. 1a) physical barriers that the agent cannot
breach with the resources it has at hand, 1b) active agent oversight,
where oversight has a kill switch and is not afraid to use it 1c)
agent design measures (penalty terms in the reward function, use of
counterfactual planning, certain training regimes, etc.) that robustly
lower the chance of the agent ever wanting to get out of the
factory. 1d) automatic safety
interlocks.
I spent some time in my professional life doing safety engineering for physical systems and
cyber-physical systems, and this is how safety gets done there. Progress is
often expensive, often annoying to certain stakeholders, and not always
philosophically very satisfying. But you cannot have everything.
OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.
You writing it quickly in half an hour also explains a lot about how it reads.
I guess we have established by now that the paper is not about your version of intuitive-corrigibility.
For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.
You interpret the abstract as follows:
Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.
The phrasing ‘works as intended’ in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of ‘intuitive corrigibility’.
So I am guessing you did not pick up on that when reading the abstract.
OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:
In the paper I don’t try to make the agent’s world model distinguish between ‘humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’. The ‘works as intended’ is that any button press for whatever reason is supposed to stop the agent.
So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’ discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.
You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.
What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.
Here is a paperclip optimizer example of ‘ensuring that the agent environment never goes off-distribution’. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.
In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.
Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.
For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.
I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.