Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world.
In some cases yeah, but I was trying to give an example where the interpretations of the world itself impacts the reward. So it’s more-or-less a version of wireheading. An agent cannot learn “Wireheading is bad” on the basis of a reward signal—wireheading by definition has a very high reward signal. So if you’re going to disincentivize wireheading, you can’t do it by finding the right reward function, but rather by finding the right cognitive architecture. (Or the right cognitive architecture and the right reward function, etc.) Right?
Silver et al.’s point might be that when building an AGI, we wouldn’t have to take that shortcut, at least not by handcoding it.
I don’t follow your bird example. What are you assuming is the reward function in your example?
In this comment I gave an example of a thing you might want an agent to do which seems awfully hard to incentivize via a reward function, even if it’s an AGI that (you might think) doesn’t “die” and lose its (within-lifetime) memory like animals do.
But it might still be an interesting question how things would go assuming a perfect reward signal?
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
I suspect that, in principle, any possible motivation system (compatible with the AGI’s current knowledge / world-model) can be installed by some possible reward signal. But it might be a reward signal that we can’t calculate in practice—in particular, it might involve things like “what exactly is the AGI thinking about right now” which require as-yet-unknown advances in interpretability and oversight. The best motivation-installation solution might involve both rewards and non-reward motivation-manipulation methods, maybe. I just think we should keep an open mind. And that we should be piling on many layers of safety.
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
and from the post:
And if we get to a point where we can design reward signals that sculpt an AGI’s motivation with surgical precision, that’s fine!
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.
Thanks!
In some cases yeah, but I was trying to give an example where the interpretations of the world itself impacts the reward. So it’s more-or-less a version of wireheading. An agent cannot learn “Wireheading is bad” on the basis of a reward signal—wireheading by definition has a very high reward signal. So if you’re going to disincentivize wireheading, you can’t do it by finding the right reward function, but rather by finding the right cognitive architecture. (Or the right cognitive architecture and the right reward function, etc.) Right?
I don’t follow your bird example. What are you assuming is the reward function in your example?
In this comment I gave an example of a thing you might want an agent to do which seems awfully hard to incentivize via a reward function, even if it’s an AGI that (you might think) doesn’t “die” and lose its (within-lifetime) memory like animals do.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
I suspect that, in principle, any possible motivation system (compatible with the AGI’s current knowledge / world-model) can be installed by some possible reward signal. But it might be a reward signal that we can’t calculate in practice—in particular, it might involve things like “what exactly is the AGI thinking about right now” which require as-yet-unknown advances in interpretability and oversight. The best motivation-installation solution might involve both rewards and non-reward motivation-manipulation methods, maybe. I just think we should keep an open mind. And that we should be piling on many layers of safety.
Sorry for my very late reply!
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
and from the post:
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.