So maybe you mean that the ideal value function would be precisely the sum of rewards.
Yes, thanks, that’s what I should have said.
In the rollout architecture you describe, there wouldn’t really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
If so, I agree, you’re right, I was wrong, I shouldn’t be carelessly going back and forth between those things, and I’ll change it.
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
Ah, that wasn’t quite my intention, but I take it as an acceptable interpretation.
My true intention was that the “reward function calculator” should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey’s paper). Learning the reward function is asking for trouble.
Of course, hard-coding the reward function is also asking for trouble, so… *shrug*
Hi again, I finally got around to reading those links, thanks!
I think what you’re saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called “the easy problem of wireheading”.
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don’t have access to the reward function, because we can’t perfectly predict what will happen in a complicated world.
Then you said, That’s bad because that means we’re not in the observation-utility paradigm.
But I don’t think that’s right, or at least not in the way I was thinking of it. We’re using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my “value function” is kinda playing the role of a utility function (albeit messier), and my “reward function” is kinda playing the role of “an external entity that swoops in from time to time and edits the utility function”. Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future.
The closest utility function analogy would be: you’re trying to make an agent with a complicated opaque utility function (because it’s a complicated world). You can’t write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to “things like that” in the future. After many such edits, maybe we’ll get the right utility function, except not really because of all the problems discussed in this post, e.g. the incentive to subvert the utility-function-editing subroutine.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
Sorry if I’m misunderstanding what you were saying.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
All sounds perfectly reasonable. I just hope you recognize that it’s all a big mess (because it’s difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you’re aware, I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it’s a generalization of what RL people mean by “value function” anyway: the value function is exactly the expected utility of the event “I wind up in this specific situation” (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let’s see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes “reliably good predictions” about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that’s a bit lame, but hopefully you get the general direction I’m trying to point in.)
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms.
Thanks, that’s helpful.
how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...
Yes, thanks, that’s what I should have said.
For “access to the reward function”, we need to predict what the reward function will do (which may involve hard-to-predict things like “the human will be pleased with what I’ve done”). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a “reward function model”, and the thing-that-predicts-summed-rewards the “value function”, and then to change “the value function may be different from the reward function” to “the value function may be different from the expected sum of rewards”. Something like that?
If so, I agree, you’re right, I was wrong, I shouldn’t be carelessly going back and forth between those things, and I’ll change it.
Ah, that wasn’t quite my intention, but I take it as an acceptable interpretation.
My true intention was that the “reward function calculator” should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey’s paper). Learning the reward function is asking for trouble.
Of course, hard-coding the reward function is also asking for trouble, so… *shrug*
Hi again, I finally got around to reading those links, thanks!
I think what you’re saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called “the easy problem of wireheading”.
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don’t have access to the reward function, because we can’t perfectly predict what will happen in a complicated world.
Then you said, That’s bad because that means we’re not in the observation-utility paradigm.
But I don’t think that’s right, or at least not in the way I was thinking of it. We’re using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my “value function” is kinda playing the role of a utility function (albeit messier), and my “reward function” is kinda playing the role of “an external entity that swoops in from time to time and edits the utility function”. Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future.
The closest utility function analogy would be: you’re trying to make an agent with a complicated opaque utility function (because it’s a complicated world). You can’t write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to “things like that” in the future. After many such edits, maybe we’ll get the right utility function, except not really because of all the problems discussed in this post, e.g. the incentive to subvert the utility-function-editing subroutine.
So it’s still in the observation-utility paradigm I think, or at least it seems to me that it doesn’t have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn’t have to. In the human example, some people are hedonists, but others aren’t.
Sorry if I’m misunderstanding what you were saying.
All sounds perfectly reasonable. I just hope you recognize that it’s all a big mess (because it’s difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you’re aware, I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it’s a generalization of what RL people mean by “value function” anyway: the value function is exactly the expected utility of the event “I wind up in this specific situation” (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let’s see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes “reliably good predictions” about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that’s a bit lame, but hopefully you get the general direction I’m trying to point in.)
Something like that?
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
Thanks, that’s helpful.
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...