Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable.
Part of what’s feeding into my skepticism here is that I think Holtman’s formalization is substantially worse than the 2015 MIRI paper. It’s adding unnecessary complexity—e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions—in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he’s using MDPs, which implicitly means everything is observable at every step—a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it’s unhelpful formalism which mostly makes it easier for problems to go unnoticed.
As far as I can tell from what I’ve read so far, he’s doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he’s communicated it all less understandably.
I don’t particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you’re enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever’s simple).
Just as one example of the sort of simplification I have in mind: the definition of f makes it so that, before button-press, the agent acts like it’s an R′N maximizer and expects to continue being an R′N maximizer indefinitely. After button-press, the agent acts like it’s an RS maximizer and expects to continue being an RS maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.
Ideally, one would also find a similarly-clear expression for what gc does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it’s doing.
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
Sure. Let’s adopt the “petrol/electric cars” thing from Holtman’s paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.
The utility functions are the same as in Holtman’s paper.
My main claim is that the π∗fcg0 agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the π∗fcgc agent, but am less confident in that claim.
Part of what’s feeding into my skepticism here is that I think Holtman’s formalization is substantially worse than the 2015 MIRI paper. It’s adding unnecessary complexity—e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions—in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he’s using MDPs, which implicitly means everything is observable at every step—a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it’s unhelpful formalism which mostly makes it easier for problems to go unnoticed.
As far as I can tell from what I’ve read so far, he’s doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he’s communicated it all less understandably.
I don’t particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you’re enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever’s simple).
Just as one example of the sort of simplification I have in mind: the definition of f makes it so that, before button-press, the agent acts like it’s an R′N maximizer and expects to continue being an R′N maximizer indefinitely. After button-press, the agent acts like it’s an RS maximizer and expects to continue being an RS maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.
Ideally, one would also find a similarly-clear expression for what gc does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it’s doing.
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
Sure. Let’s adopt the “petrol/electric cars” thing from Holtman’s paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.
The utility functions are the same as in Holtman’s paper.
My main claim is that the π∗fcg0 agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the π∗fcgc agent, but am less confident in that claim.