[Pasting in some of my responses from elsewhere in case they serve as a useful seed for discussion]
I would say that 3 is one of the core problems in reinforcement learning, but doesn’t especially depend on 1 and 2.
To me, these 3 points together imply that, in principle, it’s impossible to design an AI that doesn’t game our goals (because we can’t know with perfect precision everything that matters to achieve these goals).
How about a thermostat? It’s an awfully simple AI (and not usually created via RL, though it could be), but I can specify my goals in a way that won’t be gamed, without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.
To be clear, I definitely think that as loss functions become more complex, and become less perfect proxies for what you really want, you absolutely start running into problems with specification gaming and goal misgeneralization. My goal with the thermostat example is just to point out that that isn’t (as far as I can see) because of a fundamental limit in how precisely you can predict the future.
Alejandro: Fair point! I think I should’ve specified that I meant this line of reasoning as a counterargument to advanced general-purpose AIs, mainly.
I suspect the same argument holds there, that the problems with 3 aren’t based on 1 and 2, although it’s harder to demonstrate with more advanced systems.Here’s maybe one view on it: suppose for a moment that we could perfectly forecast the behavior of physical systems into the future (with the caveat that we couldn’t use that to perfectly predict an AI’s behavior, since otherwise we’ve assumed the problem away). I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don’t have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.(unless the caveated point is exactly what you’re trying to get at, but I don’t think that anyone out there would say that advanced AI is safe because we can perfectly predict the physical systems that include them, since in practice we can’t even remotely do that
[...] without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.[...] My goal with the thermostat example is just to point out that that isn’t (as far as I can see) because of a fundamental limit in how precisely you can predict the future.
I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn’t game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the things you care about (predictive variables) is due to the inaccuracy attribution degeneracy that I mention in the post. In other words, you don’t (and possibly can’t) know if the variable you’re interested in predicting isn’t being accurately forecasted because a lack of relevant things to be specified (most common case) or due to misspecified initial conditions of all the relevant variables.
I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don’t have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.
I partially agree: I’d say that, in that hypothetical case, you’ve solved one layer of complexity and this other one you’re mentioning still remains! I don’t claim that solving the issues raised by chaotic unpredictability solve goal gaming, but I do claim that without solving the former you cannot solve the latter (i.e., solving chaos is a necessary but not sufficient condition).
[Pasting in some of my responses from elsewhere in case they serve as a useful seed for discussion]
I would say that 3 is one of the core problems in reinforcement learning, but doesn’t especially depend on 1 and 2.
How about a thermostat? It’s an awfully simple AI (and not usually created via RL, though it could be), but I can specify my goals in a way that won’t be gamed, without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.
To be clear, I definitely think that as loss functions become more complex, and become less perfect proxies for what you really want, you absolutely start running into problems with specification gaming and goal misgeneralization. My goal with the thermostat example is just to point out that that isn’t (as far as I can see) because of a fundamental limit in how precisely you can predict the future.
Alejandro: Fair point! I think I should’ve specified that I meant this line of reasoning as a counterargument to advanced general-purpose AIs, mainly.
I suspect the same argument holds there, that the problems with 3 aren’t based on 1 and 2, although it’s harder to demonstrate with more advanced systems.Here’s maybe one view on it: suppose for a moment that we could perfectly forecast the behavior of physical systems into the future (with the caveat that we couldn’t use that to perfectly predict an AI’s behavior, since otherwise we’ve assumed the problem away). I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don’t have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.(unless the caveated point is exactly what you’re trying to get at, but I don’t think that anyone out there would say that advanced AI is safe because we can perfectly predict the physical systems that include them, since in practice we can’t even remotely do that
I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn’t game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the things you care about (predictive variables) is due to the inaccuracy attribution degeneracy that I mention in the post. In other words, you don’t (and possibly can’t) know if the variable you’re interested in predicting isn’t being accurately forecasted because a lack of relevant things to be specified (most common case) or due to misspecified initial conditions of all the relevant variables.
I partially agree: I’d say that, in that hypothetical case, you’ve solved one layer of complexity and this other one you’re mentioning still remains! I don’t claim that solving the issues raised by chaotic unpredictability solve goal gaming, but I do claim that without solving the former you cannot solve the latter (i.e., solving chaos is a necessary but not sufficient condition).