> The value function might be different from the reward function.
Surely this isn’t relevant! We don’t by any means want the value function to equal the reward function. What we want (at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and the actual world).
Hmm. I guess I have this ambiguous thing where I’m not specifying whether the value function is “valuing” world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)
Like, here’s one extreme: imagine that the “planner” does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the “value function” is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function.
On the opposite extreme, when you’re not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like “expected upcoming reward assuming the current policy”, which I think is what you’re saying.
Incidentally, I think the brain does both. Like, maybe I’m putting on my shoes because I know that this is the first step of a plan where I’ll go to the candy store and buy candy and eat it. I’m motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I’ll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy. But maybe eventually, after many such trips to the candy store, it becomes an ingrained habit. And then I’m motivated to put on my shoes because my brain has cached the idea that good things are going to happen as a result—i.e., I’m motivated even if I don’t explicitly visualize myself eating candy soon.
I guess I spend more time thinking about the former (the value function is evaluating the eventual consequences of a plan) than the latter (the value function is tracking the value of immediate world-states and actions), because the former is the component that presents most of the x-risk. So that’s what was in my head when I wrote that.
(It’s not either/or; I think there’s a continuum between those two poles. Like I can consequentialist-plan to get into a future state that has a high cached value but no immediate reward.)
As for prosaic RL systems, they’re set up in different ways I guess, and I’m not an expert on the literature. In Human Compatible, if I recall, Stuart Russell said that he thinks the ability to do flexible hierarchical consequentialist planning is something that prosaic AI doesn’t have yet, but that future AGI will need. If that’s right, then maybe this is an area where I should expect AGI to be different from prosaic AI, and where I shouldn’t get overly worried about being insufficiently prosaic. I dunno :-P
Well anyway, your point is well taken. Maybe I’ll change it to “the value function might be misaligned with the reward function”, or “incompatible”, or something...
Hmm. I guess I have this ambiguous thing where I’m not specifying whether the value function is “valuing” world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)
Sure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a “solution concept” saying what it means for the value function to be correct with respect to a set reward function and world-model. This will be your notion of alignment, not “are the two equal”.
Like, here’s one extreme: imagine that the “planner” does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the “value function” is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function.
Well, there’s still a type distinction. The reward function gives a value at each time step in the long rollout, while the value function just gives an overall value. So maybe you mean that the ideal value function would be precisely the sum of rewards.
But if so, this isn’t really what RL people typically call a value function. The point of a value function is to capture the potential future rewards associated with a state. For example, if your reward function is to be high up, then the value of being near the top of a slide is very low (because you’ll soon be at the bottom), even if it’s still generating high reward (because you’re currently high up).
So the value of a history (even a long rollout of the future) should incorporate anticipated rewards after the end of the history, not just the value observed within the history itself.
In the rollout architecture you describe, there wouldn’t really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
On the opposite extreme, when you’re not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like “expected upcoming reward assuming the current policy”, which I think is what you’re saying.
It doesn’t seem to me like there is any “more/less like reward” spectrum here. The value function is just different from the reward function. In an architecture where you have a “value function” which operates like a reward function, I would just call it the “estimated reward function” or something along those lines, because RL people invented the value/reward distinction to point at something important (namely the difference between immediate reward and cumulative expected reward), and I don’t want to use the terms in a way which gets rid of that distinction.
Like, maybe I’m putting on my shoes because I know that this is the first step of a plan where I’ll go to the candy store and buy candy and eat it. I’m motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I’ll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy.
How is this “approximating the reward function”?? Again, if you feed both the value and reward function the same thing (the imagined history of going to the store and coming back and eating candy), you hope that they produce very different results (the reward function produces a sequence of individual rewards for each moment, including a high reward when you’re eating the candy; the value function produces one big number accounting for the positives and negatives of the plan, including estimated future value of the post-candy-eating crash, even though that’s not represented inside the history).
Well anyway, your point is well taken. Maybe I’ll change it to “the value function might be misaligned with the reward function”, or “incompatible”, or something...
I continue to feel like you’re not seeing that there is a precise formal notion of “the value function is aligned with the reward function”, namely, that the value function is the solution to the value iteration equation (the bellman equation) wrt a given reward function and world model.
So maybe you mean that the ideal value function would be precisely the sum of rewards.
Yes, thanks, that’s what I should have said.
In the rollout architecture you describe, there wouldn’t really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
If so, I agree, you’re right, I was wrong, I shouldn’t be carelessly going back and forth between those things, and I’ll change it.
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
Ah, that wasn’t quite my intention, but I take it as an acceptable interpretation.
My true intention was that the “reward function calculator” should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey’s paper). Learning the reward function is asking for trouble.
Of course, hard-coding the reward function is also asking for trouble, so… *shrug*
Hi again, I finally got around to reading those links, thanks!
I think what you’re saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called “the easy problem of wireheading”.
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don’t have access to the reward function, because we can’t perfectly predict what will happen in a complicated world.
Then you said, That’s bad because that means we’re not in the observation-utility paradigm.
But I don’t think that’s right, or at least not in the way I was thinking of it. We’re using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my “value function” is kinda playing the role of a utility function (albeit messier), and my “reward function” is kinda playing the role of “an external entity that swoops in from time to time and edits the utility function”. Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future.
The closest utility function analogy would be: you’re trying to make an agent with a complicated opaque utility function (because it’s a complicated world). You can’t write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to “things like that” in the future. After many such edits, maybe we’ll get the right utility function, except not really because of all the problems discussed in this post, e.g. the incentive to subvert the utility-function-editing subroutine.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
Sorry if I’m misunderstanding what you were saying.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
All sounds perfectly reasonable. I just hope you recognize that it’s all a big mess (because it’s difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you’re aware, I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it’s a generalization of what RL people mean by “value function” anyway: the value function is exactly the expected utility of the event “I wind up in this specific situation” (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let’s see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes “reliably good predictions” about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that’s a bit lame, but hopefully you get the general direction I’m trying to point in.)
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms.
Thanks, that’s helpful.
how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...
Thanks!!
Hmm. I guess I have this ambiguous thing where I’m not specifying whether the value function is “valuing” world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)
Like, here’s one extreme: imagine that the “planner” does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the “value function” is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function.
On the opposite extreme, when you’re not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like “expected upcoming reward assuming the current policy”, which I think is what you’re saying.
Incidentally, I think the brain does both. Like, maybe I’m putting on my shoes because I know that this is the first step of a plan where I’ll go to the candy store and buy candy and eat it. I’m motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I’ll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy. But maybe eventually, after many such trips to the candy store, it becomes an ingrained habit. And then I’m motivated to put on my shoes because my brain has cached the idea that good things are going to happen as a result—i.e., I’m motivated even if I don’t explicitly visualize myself eating candy soon.
I guess I spend more time thinking about the former (the value function is evaluating the eventual consequences of a plan) than the latter (the value function is tracking the value of immediate world-states and actions), because the former is the component that presents most of the x-risk. So that’s what was in my head when I wrote that.
(It’s not either/or; I think there’s a continuum between those two poles. Like I can consequentialist-plan to get into a future state that has a high cached value but no immediate reward.)
As for prosaic RL systems, they’re set up in different ways I guess, and I’m not an expert on the literature. In Human Compatible, if I recall, Stuart Russell said that he thinks the ability to do flexible hierarchical consequentialist planning is something that prosaic AI doesn’t have yet, but that future AGI will need. If that’s right, then maybe this is an area where I should expect AGI to be different from prosaic AI, and where I shouldn’t get overly worried about being insufficiently prosaic. I dunno :-P
Well anyway, your point is well taken. Maybe I’ll change it to “the value function might be misaligned with the reward function”, or “incompatible”, or something...
Sure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a “solution concept” saying what it means for the value function to be correct with respect to a set reward function and world-model. This will be your notion of alignment, not “are the two equal”.
Well, there’s still a type distinction. The reward function gives a value at each time step in the long rollout, while the value function just gives an overall value. So maybe you mean that the ideal value function would be precisely the sum of rewards.
But if so, this isn’t really what RL people typically call a value function. The point of a value function is to capture the potential future rewards associated with a state. For example, if your reward function is to be high up, then the value of being near the top of a slide is very low (because you’ll soon be at the bottom), even if it’s still generating high reward (because you’re currently high up).
So the value of a history (even a long rollout of the future) should incorporate anticipated rewards after the end of the history, not just the value observed within the history itself.
In the rollout architecture you describe, there wouldn’t really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
It doesn’t seem to me like there is any “more/less like reward” spectrum here. The value function is just different from the reward function. In an architecture where you have a “value function” which operates like a reward function, I would just call it the “estimated reward function” or something along those lines, because RL people invented the value/reward distinction to point at something important (namely the difference between immediate reward and cumulative expected reward), and I don’t want to use the terms in a way which gets rid of that distinction.
How is this “approximating the reward function”?? Again, if you feed both the value and reward function the same thing (the imagined history of going to the store and coming back and eating candy), you hope that they produce very different results (the reward function produces a sequence of individual rewards for each moment, including a high reward when you’re eating the candy; the value function produces one big number accounting for the positives and negatives of the plan, including estimated future value of the post-candy-eating crash, even though that’s not represented inside the history).
I continue to feel like you’re not seeing that there is a precise formal notion of “the value function is aligned with the reward function”, namely, that the value function is the solution to the value iteration equation (the bellman equation) wrt a given reward function and world model.
Yes, thanks, that’s what I should have said.
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
If so, I agree, you’re right, I was wrong, I shouldn’t be carelessly going back and forth between those things, and I’ll change it.
Ah, that wasn’t quite my intention, but I take it as an acceptable interpretation.
My true intention was that the “reward function calculator” should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey’s paper). Learning the reward function is asking for trouble.
Of course, hard-coding the reward function is also asking for trouble, so… *shrug*
Hi again, I finally got around to reading those links, thanks!
I think what you’re saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called “the easy problem of wireheading”.
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don’t have access to the reward function, because we can’t perfectly predict what will happen in a complicated world.
Then you said, That’s bad because that means we’re not in the observation-utility paradigm.
But I don’t think that’s right, or at least not in the way I was thinking of it. We’re using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my “value function” is kinda playing the role of a utility function (albeit messier), and my “reward function” is kinda playing the role of “an external entity that swoops in from time to time and edits the utility function”. Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future.
The closest utility function analogy would be: you’re trying to make an agent with a complicated opaque utility function (because it’s a complicated world). You can’t write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to “things like that” in the future. After many such edits, maybe we’ll get the right utility function, except not really because of all the problems discussed in this post, e.g. the incentive to subvert the utility-function-editing subroutine.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
Sorry if I’m misunderstanding what you were saying.
All sounds perfectly reasonable. I just hope you recognize that it’s all a big mess (because it’s difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you’re aware, I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it’s a generalization of what RL people mean by “value function” anyway: the value function is exactly the expected utility of the event “I wind up in this specific situation” (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let’s see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes “reliably good predictions” about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that’s a bit lame, but hopefully you get the general direction I’m trying to point in.)
Something like that?
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
Thanks, that’s helpful.
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...