Disclaimer: I have only read the abstract of the “Reward is enough” paper. Also, I don’t have much experience in AI safety, but I consider changing that.
Here are a couple of my thoughts.
Your examples haven’t entirely convinced me that reward isn’t enough. Take the bird. As I see it, something like the following is going on:
Evolution chose to take a shortcut: maybe a bird with a very large brain and a lot of time would eventually figure out that singing is a smart thing to do if it received reward for singing well. But evolution being a ruthless optimizer with many previous generations of experience, shaped two separate rewards in the way you described. Silver et al.’s point might be that when building an AGI, we wouldn’t have to take that shortcut, at least not by handcoding it. Assume we have an agent that is released into the world and is trying to optimize reward. It starts out from scratch, knowing nothing, but with a lot of time and the computational capacity to learn a lot. Such an agent has an incentive to explore. So it tries out singing for two minutes. It notes that in the first minute it got 1 unit of reward and in the second 2 (it got better!). All in all, 3 units is very little in this world however, so maybe it moves on. But as it gains more experience in the world it notices that patterns like these can often be extrapolated. Maybe, with its two minutes of experience, if it sang for a third minute, it would get 3 units of reward? It tries and yes indeed. Now it has an incentive to see how far it can take this. It knows the 4 units it expects from the next try will not be worth its time on their own, but the information of whether it could eventually get a million units per minute this way is very much worth the cost!
Something kind of analogous should be true for the spider story. Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world. At the beginning it might get this wrong, but it’s unfair to compare it to a human who has had this info “handcoded in” by evolution. If our algorithms don’t allow you to learn to update differently in the future, because past update were unhelpful (I don’t know, pointers welcome!), then that’s not a problem with reward, it’s a problem with our algorithms! Maybe this is what you alluded to in your very last paragraph where you speculated that they might just mean a more sophisticated RL algorithm?
Concerning the deceptive AGI etc., I agree problems emerge when we don’t get the reward signal exactly right and that it’s probably not a safe assumption that we will. But it might still be an interesting question how things would go assuming a perfect reward signal? My impression is that their answer is “it would basically work”, while yours is something like “but we really shouldn’t assume that and if we don’t, then it’s probably better to have separate reward signals etc.”. Given the bird example, I assume you also don’t agree that things would work out fine even if we did have the best possible reward signal?
Also, I just want to mention that I agree the distinction between within-lifetime RL and intergenerational RL is useful, certainly in the case of biology and probably in machine learning too.
Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world.
In some cases yeah, but I was trying to give an example where the interpretations of the world itself impacts the reward. So it’s more-or-less a version of wireheading. An agent cannot learn “Wireheading is bad” on the basis of a reward signal—wireheading by definition has a very high reward signal. So if you’re going to disincentivize wireheading, you can’t do it by finding the right reward function, but rather by finding the right cognitive architecture. (Or the right cognitive architecture and the right reward function, etc.) Right?
Silver et al.’s point might be that when building an AGI, we wouldn’t have to take that shortcut, at least not by handcoding it.
I don’t follow your bird example. What are you assuming is the reward function in your example?
In this comment I gave an example of a thing you might want an agent to do which seems awfully hard to incentivize via a reward function, even if it’s an AGI that (you might think) doesn’t “die” and lose its (within-lifetime) memory like animals do.
But it might still be an interesting question how things would go assuming a perfect reward signal?
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
I suspect that, in principle, any possible motivation system (compatible with the AGI’s current knowledge / world-model) can be installed by some possible reward signal. But it might be a reward signal that we can’t calculate in practice—in particular, it might involve things like “what exactly is the AGI thinking about right now” which require as-yet-unknown advances in interpretability and oversight. The best motivation-installation solution might involve both rewards and non-reward motivation-manipulation methods, maybe. I just think we should keep an open mind. And that we should be piling on many layers of safety.
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
and from the post:
And if we get to a point where we can design reward signals that sculpt an AGI’s motivation with surgical precision, that’s fine!
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.
Interesting post!
Disclaimer: I have only read the abstract of the “Reward is enough” paper. Also, I don’t have much experience in AI safety, but I consider changing that.
Here are a couple of my thoughts.
Your examples haven’t entirely convinced me that reward isn’t enough.
Take the bird. As I see it, something like the following is going on:Evolution chose to take a shortcut: maybe a bird with a very large brain and a lot of time would eventually figure out that singing is a smart thing to do if it received reward for singing well. But evolution being a ruthless optimizer with many previous generations of experience, shaped two separate rewards in the way you described. Silver et al.’s point might be that when building an AGI, we wouldn’t have to take that shortcut, at least not by handcoding it.Assume we have an agent that is released into the world and is trying to optimize reward. It starts out from scratch, knowing nothing, but with a lot of time and the computational capacity to learn a lot.Such an agent has an incentive to explore. So it tries out singing for two minutes. It notes that in the first minute it got 1 unit of reward and in the second 2 (it got better!). All in all, 3 units is very little in this world however, so maybe it moves on.But as it gains more experience in the world it notices that patterns like these can often be extrapolated. Maybe, with its two minutes of experience, if it sang for a third minute, it would get 3 units of reward? It tries and yes indeed. Now it has an incentive to see how far it can take this. It knows the 4 units it expects from the next try will not be worth its time on their own, but the information of whether it could eventually get a million units per minute this way is very much worth the cost!Something kind of analogous should be true for the spider story.Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world. At the beginning it might get this wrong, but it’s unfair to compare it to a human who has had this info “handcoded in” by evolution.If our algorithms don’t allow you to learn to update differently in the future, because past update were unhelpful (I don’t know, pointers welcome!), then that’s not a problem with reward, it’s a problem with our algorithms!Maybe this is what you alluded to in your very last paragraph where you speculated that they might just mean a more sophisticated RL algorithm?Concerning the deceptive AGI etc., I agree problems emerge when we don’t get the reward signal exactly right and that it’s probably not a safe assumption that we will. But it might still be an interesting question how things would go assuming a perfect reward signal?
My impression is that their answer is “it would basically work”, while yours is something like “but we really shouldn’t assume that and if we don’t, then it’s probably better to have separate reward signals etc.”. Given the bird example, I assume you also don’t agree that things would work out fine even if we did have the best possible reward signal?
Also, I just want to mention that I agree the distinction between within-lifetime RL and intergenerational RL is useful, certainly in the case of biology and probably in machine learning too.
Thanks!
In some cases yeah, but I was trying to give an example where the interpretations of the world itself impacts the reward. So it’s more-or-less a version of wireheading. An agent cannot learn “Wireheading is bad” on the basis of a reward signal—wireheading by definition has a very high reward signal. So if you’re going to disincentivize wireheading, you can’t do it by finding the right reward function, but rather by finding the right cognitive architecture. (Or the right cognitive architecture and the right reward function, etc.) Right?
I don’t follow your bird example. What are you assuming is the reward function in your example?
In this comment I gave an example of a thing you might want an agent to do which seems awfully hard to incentivize via a reward function, even if it’s an AGI that (you might think) doesn’t “die” and lose its (within-lifetime) memory like animals do.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
I suspect that, in principle, any possible motivation system (compatible with the AGI’s current knowledge / world-model) can be installed by some possible reward signal. But it might be a reward signal that we can’t calculate in practice—in particular, it might involve things like “what exactly is the AGI thinking about right now” which require as-yet-unknown advances in interpretability and oversight. The best motivation-installation solution might involve both rewards and non-reward motivation-manipulation methods, maybe. I just think we should keep an open mind. And that we should be piling on many layers of safety.
Sorry for my very late reply!
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
and from the post:
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.