Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
I haven’t seen you give arguments that your models [of instrumental convergence] are [useful for realistic agents]
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is
∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
Suppose the laws of physics bestow godhood upon an agent executing some convoluted series of actions; in particular, this allows avoiding heat death. Clearly, it is optimal for the vast majority of agents to instantly become god.
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
Suppose that the MDP modeling the real world represents shutdown as a single terminal state. Most optimal agents don’t allow themselves to be shut down. Furthermore, since we can see that most goals offer better reward at non-shutdown states, superintelligent A can as well.[1] While I don’t know exactly what Awill tend to do, I predict that policies generated by A will tend to resist shutdown.
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent Awon’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent A won’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.