I claim that most reward functions lead to agents with strong convergent instrumental goals
I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn’t yet have any goals, and then you train it on a random reward function—then yes, it probably will develop strong convergent instrumental goals.
On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called “goals”.
I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it’s true that we’re eventually going to get highly intelligent agents which can make long-term plans, it’s also important that we get to control what reward functions they’re trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in “local optima” in the way they think about convergent instrumental goals—i.e. they’re missing whatever cognitive functionality is required for being ambitious on a large scale.
Agreed – I should have clarified. I’ve been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.
Makes sense. For what it’s worth, I’d also argue that thinking about optimal policies at all is misguided (e.g. what’s the optimal policy for humans—the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we’d be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying “thinking about optimal policies at all is misguided”, and I was very wrong to disagree. I’ve thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology—not about optimal policies for a reward function.)
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don’t expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why.
Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don’t understand optimal policies… good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum.
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point.
1. I think it’s more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn’t work in practice, everyone will use the former). But we shouldn’t pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute.
2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it’s not going to take over the world).
Again, let’s try cash this out. I give you a human—or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
TurnTrout: here’s how spherical cows roll downhill!
ricraz: real cows aren’t spheres.
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
I haven’t seen you give arguments that your models [of instrumental convergence] are [useful for realistic agents]
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is
∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
Suppose the laws of physics bestow godhood upon an agent executing some convoluted series of actions; in particular, this allows avoiding heat death. Clearly, it is optimal for the vast majority of agents to instantly become god.
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
Suppose that the MDP modeling the real world represents shutdown as a single terminal state. Most optimal agents don’t allow themselves to be shut down. Furthermore, since we can see that most goals offer better reward at non-shutdown states, superintelligent A can as well.[1] While I don’t know exactly what Awill tend to do, I predict that policies generated by A will tend to resist shutdown.
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent Awon’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1]
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point. ↩︎
Is the point that people try to use algorithms which they think will eventually converge to the optimal policy? (Assuming there is one.)
I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn’t yet have any goals, and then you train it on a random reward function—then yes, it probably will develop strong convergent instrumental goals.
On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called “goals”.
I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it’s true that we’re eventually going to get highly intelligent agents which can make long-term plans, it’s also important that we get to control what reward functions they’re trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in “local optima” in the way they think about convergent instrumental goals—i.e. they’re missing whatever cognitive functionality is required for being ambitious on a large scale.
Agreed – I should have clarified. I’ve been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.
Makes sense. For what it’s worth, I’d also argue that thinking about optimal policies at all is misguided (e.g. what’s the optimal policy for humans—the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we’d be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying “thinking about optimal policies at all is misguided”, and I was very wrong to disagree. I’ve thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology—not about optimal policies for a reward function.)
I disagree.
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don’t expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why.
Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don’t understand optimal policies… good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum.
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point.
1. I think it’s more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn’t work in practice, everyone will use the former). But we shouldn’t pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute.
2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it’s not going to take over the world).
Again, let’s try cash this out. I give you a human—or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent A won’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
Is the point that people try to use algorithms which they think will eventually converge to the optimal policy? (Assuming there is one.)
Something like that, yeah.