My very general concern is that strategies that maximize AUP reward might be very… let’s say creative, and your claims are mostly relying on intuitive arguments for why those strategies won’t be bad for humans.
My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.
I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.
[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.
There’s probably more work to be done like this.]
I don’t really buy the claim that if you’ve been able to patch each specific problem, we’ll soon reach a version with no problems—the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.
The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.
Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.
This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.
It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of λ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).
[ETA: actually, if the “reasonable” reward is really, really low in expectation, it’s not clear what happens. this might happen if catastrophe befalls us by default.]
You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.
The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!
My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.
I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.
[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.
There’s probably more work to be done like this.]
The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.
Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.
It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of λ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).
[ETA: actually, if the “reasonable” reward is really, really low in expectation, it’s not clear what happens. this might happen if catastrophe befalls us by default.]
You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.
The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!