I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
TurnTrout: here’s how spherical cows roll downhill!
ricraz: real cows aren’t spheres.
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
I haven’t seen you give arguments that your models [of instrumental convergence] are [useful for realistic agents]
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is
∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
Suppose the laws of physics bestow godhood upon an agent executing some convoluted series of actions; in particular, this allows avoiding heat death. Clearly, it is optimal for the vast majority of agents to instantly become god.
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
Suppose that the MDP modeling the real world represents shutdown as a single terminal state. Most optimal agents don’t allow themselves to be shut down. Furthermore, since we can see that most goals offer better reward at non-shutdown states, superintelligent A can as well.[1] While I don’t know exactly what Awill tend to do, I predict that policies generated by A will tend to resist shutdown.
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent Awon’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent A won’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.