The π∗fcg0 agent is indifferent between creating stoppable or unstoppable subagents, but the π∗fcgc agent goes back to being corrigible in this way. The “emergent incentive” handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we’re commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn’t work, you must also believe there’s a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it’s in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I’m not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what’s going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you’re going to need an unbounded offset term.
I found that the paper’s proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I’d encourage you to be a bit more willing to dive into this work, possibly including the paper series it’s part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.’s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I’m still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there’s been no progress on the shutdown problem since Sores et al. seems… wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman’s paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman’s safety layer does not prevent the agent from taking the “create an unstoppable sub-agent” action. I’ll code it up, apply the correction term, and get back to you.
The π∗fcg0 agent is indifferent between creating stoppable or unstoppable subagents, but the π∗fcgc agent goes back to being corrigible in this way.
I think this is wrong? The π∗fcg0 agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent.
Intuitive reasoning: prior to button-press, that agent acts-as-though it’s an RN maximizer and expects to continue being an RN maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under RN after the button is pressed than some other successor which does not shut down and instead just keeps optimizing RN.
Either I’m missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don’t shut down.
Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable.
Part of what’s feeding into my skepticism here is that I think Holtman’s formalization is substantially worse than the 2015 MIRI paper. It’s adding unnecessary complexity—e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions—in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he’s using MDPs, which implicitly means everything is observable at every step—a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it’s unhelpful formalism which mostly makes it easier for problems to go unnoticed.
As far as I can tell from what I’ve read so far, he’s doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he’s communicated it all less understandably.
I don’t particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you’re enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever’s simple).
Just as one example of the sort of simplification I have in mind: the definition of f makes it so that, before button-press, the agent acts like it’s an R′N maximizer and expects to continue being an R′N maximizer indefinitely. After button-press, the agent acts like it’s an RS maximizer and expects to continue being an RS maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.
Ideally, one would also find a similarly-clear expression for what gc does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it’s doing.
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
Sure. Let’s adopt the “petrol/electric cars” thing from Holtman’s paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.
The utility functions are the same as in Holtman’s paper.
My main claim is that the π∗fcg0 agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the π∗fcgc agent, but am less confident in that claim.
The π∗fcg0 agent is indifferent between creating stoppable or unstoppable subagents, but the π∗fcgc agent goes back to being corrigible in this way. The “emergent incentive” handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we’re commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn’t work, you must also believe there’s a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it’s in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I’m not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what’s going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you’re going to need an unbounded offset term.
I found that the paper’s proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I’d encourage you to be a bit more willing to dive into this work, possibly including the paper series it’s part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.’s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I’m still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there’s been no progress on the shutdown problem since Sores et al. seems… wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman’s paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman’s safety layer does not prevent the agent from taking the “create an unstoppable sub-agent” action. I’ll code it up, apply the correction term, and get back to you.
I think this is wrong? The π∗fcg0 agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent.
Intuitive reasoning: prior to button-press, that agent acts-as-though it’s an RN maximizer and expects to continue being an RN maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under RN after the button is pressed than some other successor which does not shut down and instead just keeps optimizing RN.
Either I’m missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don’t shut down.
Part of what’s feeding into my skepticism here is that I think Holtman’s formalization is substantially worse than the 2015 MIRI paper. It’s adding unnecessary complexity—e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions—in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he’s using MDPs, which implicitly means everything is observable at every step—a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it’s unhelpful formalism which mostly makes it easier for problems to go unnoticed.
As far as I can tell from what I’ve read so far, he’s doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he’s communicated it all less understandably.
I don’t particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you’re enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever’s simple).
Just as one example of the sort of simplification I have in mind: the definition of f makes it so that, before button-press, the agent acts like it’s an R′N maximizer and expects to continue being an R′N maximizer indefinitely. After button-press, the agent acts like it’s an RS maximizer and expects to continue being an RS maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.
Ideally, one would also find a similarly-clear expression for what gc does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it’s doing.
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
Sure. Let’s adopt the “petrol/electric cars” thing from Holtman’s paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.
The utility functions are the same as in Holtman’s paper.
My main claim is that the π∗fcg0 agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the π∗fcgc agent, but am less confident in that claim.