What would you say is wrong with the ‘exaggerated’ criticism?
I don’t think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn’t claim that—just that there are other assumptions needed for disaster to be highly likely.
I don’t have an elevator pitch summary of my views yet, and it’s possible that my interpretation of the classic arguments is wrong, I haven’t reread them recently. But here’s an attempt:
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what’s right, what we intended, etc. and 2. that smarter AI wouldn’t lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn’t. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.
--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true. Later the book goes on to talk about how making it good is more difficult than it sounds. Moreover, Bostrom doesn’t wave around his arguments about they are proofs; he includes lots of hedge words and maybes. I think we can interpret it as a burden-shifting argument; “Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you’d better have some pretty solid arguments that everything’s going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important).” As far as I know no one has come up with any such arguments, and in fact it’s now the consensus in the field that no one has found such an argument.
Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.
...
Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what’s right, what we intended, etc. and 2. that smarter AI wouldn’t lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn’t. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.
Speaking from personal experience, those views both felt obvious to me before I came across Orthogonality Thesis or Instrumental convergence.
--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true.
It depends on what you mean by ‘special effort’ and ‘default’. The Orthogonality thesis, instrumental convergence, and eventual fast growth together establish that if we increased intelligence while not increasing alignment, a disaster would result. That is what is correct about them. What they don’t establish is how natural it is that we will increase intelligence without increasing alignment to the degree necessary to stave off disaster.
It may be the case that the particular technique for building very powerful AI that is easiest to use is a technique that makes alignment and capability increase together, so you usually get the alignment you need just in the course of trying to make your system more capable.
Depending on how you look at that possibility, you could say that’s an example of the ‘special effort’ being not as difficult as it appeared / likely to be made by default, or that the claim is just wrong and the default outcome is not doom. I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way—there might be a further fact that says why OT and IC don’t apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.
For the reasons you give, the Orthogonality thesis and instrumental convergence do shift the burden of proof to explaining why you wouldn’t get misalignment, especially if progress is fast. But such reasons have been given, see e.g. this from Stuart Russell:
The first reason for optimism [about AI alignment] is that there are strong economic incentives to develop AI systems that defer to humans and gradually align themselves to user preferences and intentions. Such systems will be highly desirable: the range of behaviours they can exhibit is simply far greater than that of machines with fixed, known objectives...
And there are outside-view analogies with other technologies that suggests that by default alignment and capability do tend to covary to quite a large extent. This is a large part of Ben Garfinkel’s argument.
But I do think that some people (maybe not Bostrom, based on the caveats he gave), didn’t realise that they did also need to complete the argument to have a strong expectation of doom—to show that there isn’t an easy, and required alignment technique that we’ll have a strong incentive to use.
“A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. ”
We could see this as marking out a potential danger—a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist ‘weakly suggest’ (Ben’s words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we’re ‘shooting into the dark’ in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world.
I also think that, especially when you bring Mesa-optimisers or recent evidence into the picture, the evidence we have so far suggests that even though alignment and capability are likely to covary to some degree (a degree higher than e.g. Bostrom expected back before modern ML), the default outcome is still misalignment.
I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way—there might be a further fact that says why OT and IC don’t apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.
I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL.
I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. Iprovedthat that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like “optimal farsighted agents tend to preserve their access to terminal states” (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and “optimal agents generally choose paths through the future that afford strictly more options” (Generalizing the Power-Seeking Theorems, Theorem 2).
The theorems aren’t conclusive evidence:
maybe we don’t get AGI through RL
learned policies are not going to be optimal
the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
perhaps the agents inherit different mesa objectives during training
The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.
But the theorems are evidence that RL leads to catastrophe at optimum, at least. We’re not just talking about “the space of all possible minds and desires” anymore.
We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicide
Any possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.
On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here’s why I think it’s weak:
I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it’s too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that “most” minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.
I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:
It may be useful to distinguish between “power-seeking behavior” and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior—it’s akin to avoiding “State 2″ in your diagram—but it is something that we’d want any good house-cleaning robot to do. It’s only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can’t shut you off.)
This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and “power-seeking behavior” (in the context of some particular MDP) does reduce the ominousness of the result to some degree.
Ultimately, for most real-world tasks, I think it’s unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM “scalable agent alignment” paper, Rohin’s “narrow value learning” sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like “pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well.” There’s seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it’s reasonable to be optimistic that we’ll do much better than random when producing the policies of advanced AI systems.
I do still think that the ICT is true, though, and I do still think that it matters: it’s (basically) necessary for establishing a high level of misalignment risk. I just don’t think it’s sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).
But the theorems are evidence that RL leads to catastrophe at optimum, at least.
RL with a randomly chosen reward leads to catastrophe at optimum.
Iprovedthat that optimal policies are generally power-seeking in MDPs.
The proof is for randomly distributed rewards.
Ben’s main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.
(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don’t error.)
I do agree that this isn’t enough reason to say “there is no risk”, but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Every specific attempt so far has been seemingly unsuccessful
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I don’t know that it makes me feel much better about future objectives being outer aligned.
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I find myself agreeing with the idea that an agent unaware of it’s task will seek power, but also conclude that an agent aware of it’s task will give-up power.
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
If we do not know an agent’s goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent’s perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles C1 and C2 the agent is more likely to select the larger set.
I’d rewrite this as:
Theorem 19 suggests that, if an agent that knows its reward is about to choose between C1 and C2, but we don’t know the reward and our prior is that it is uniformly distributed, then we will assign higher probability to the agent going to the larger set.
I do not see how the agent ‘seeks’ out powerful states because, as you say, the agent is fixed.
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.
Great observation. Similarly, a hypothesis called “Maximum Causal Entropy” once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don’t perpetually maximize their potential partners—they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
If there’s a collection of ‘turned-off’ terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state.
To clarify, I don’t assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.
But, maybe you’re saying “for realistic distributions, the agent won’t get any reward for being shut off and therefore π∗ won’t ever let itself be shut off”. I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient “obedience” rewards, what then? For some level of farsightedness (γ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.
The paper doesn’t draw the causal diagram “Power → instrumental convergence”, it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.
In general, I’d suspect that there are goals we could give the agent that significantly reduce our gain. However, I’d also suspect the opposite.
Yes, right. The point isn’t that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.
It seems reasonable to argue that we would if we could guarantee r=h.
Yes, if r=h, introduce the agent. You can formalize a kind of “alignment capability” by introducing a joint distribution over the human’s goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we’d implement an agent with goal X’, and so on. You then take our expected optimal value under this distribution and find whether you’re good at alignment, or whether you’re bad and you’ll build agents whose optimal policies tend to obstruct you.
There might be a way to argue over randomness and say this would double our gain.
The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn’t.
More speculatively, what if |r−h|<ϵ?
If the rewards are ϵ-close in sup-norm, then you can get nice regret bounds, sure.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
My philosophy is that aligned/general is OK based on a shared (?) premise that,
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.
I mostly agree with what you say here—which is why I said the criticisms were exaggerated, not totally wrong—but I do think the classic arguments are still better than you portray them. In particular, I don’t remember coming away from Superintelligence (I read it when it first came out) thinking that we’d have an AI system capable of optimizing any goal and we’d need to figure out what goal to put into it. Instead I thought that we’d be building AI through some sort of iterative process where we look at existing systems, come up with tweaks, build a new and better system, etc. and that if we kept with the default strategy (which is to select for and aim for systems with the most impressive capabilities/intelligence, and not care about their alignment—just look at literally every AI system made in the lab so far! Is AlphaGo trained to be benevolent? Is AlphaStar? Is GPT? Etc.) then probably doom.
It’s true that when people are building systems not for purposes of research, but for purposes of economic application—e.g. Alexa, Google Search, facebook’s recommendation algorithm—then they seem to put at least some effort into making the systems aligned as well as intelligent. However history also tells us that not very much effort is put in, by default, and that these systems would totally kill us all if they were smarter. Moreover, usually systems appear in research-land first before they appear in economic-application-land. This is what I remember myself thinking in 2014, and I still think it now. I think the burden of proof has totally not been met; we still don’t have good reason to think the outcome will probably be non-doom in the absence of more AI safety effort.
It’s possible my memory is wrong though. I should reread the relevant passages.
When I wrote that I was mostly taking what Ben Garfinkel said about the ‘classic arguments’ at face value, but I do recall that there used to be a lot of loose talk about putting values into an AGI after building it.
I think we can interpret it as a burden-shifting argument; “Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you’d better have some pretty solid arguments that everything’s going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important).” As far as I know no one has come up with any such arguments, and in fact it’s now the consensus in the field that no one has found such an argument.
I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: “It is physically possible to build an AI system that would try to kill everyone.” The ICT basically says: “Most possible AI systems within some particular set would try to kill everyone.” If we stop here, then we haven’t gotten very far.
To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It’s true that it’s physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth’s atmosphere would also send astronauts aimlessly out into the depths of space. But I don’t think it’d be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don’t want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you’re not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.
Even sillier example: It’s possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don’t need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don’t currently know how to make an effective coronavirus vaccine.
I think the OT and ICT on their own almost definitely aren’t enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.
(Although a lot depends on how high a credence we’re trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)
I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don’t need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn’t need to be met.
But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.
First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.
Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.
Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.
Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.
There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment
What would you say is wrong with the ‘exaggerated’ criticism?
I don’t think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn’t claim that—just that there are other assumptions needed for disaster to be highly likely.
I don’t have an elevator pitch summary of my views yet, and it’s possible that my interpretation of the classic arguments is wrong, I haven’t reread them recently. But here’s an attempt:
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what’s right, what we intended, etc. and 2. that smarter AI wouldn’t lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn’t. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.
--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true. Later the book goes on to talk about how making it good is more difficult than it sounds. Moreover, Bostrom doesn’t wave around his arguments about they are proofs; he includes lots of hedge words and maybes. I think we can interpret it as a burden-shifting argument; “Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you’d better have some pretty solid arguments that everything’s going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important).” As far as I know no one has come up with any such arguments, and in fact it’s now the consensus in the field that no one has found such an argument.
Speaking from personal experience, those views both felt obvious to me before I came across Orthogonality Thesis or Instrumental convergence.
It depends on what you mean by ‘special effort’ and ‘default’. The Orthogonality thesis, instrumental convergence, and eventual fast growth together establish that if we increased intelligence while not increasing alignment, a disaster would result. That is what is correct about them. What they don’t establish is how natural it is that we will increase intelligence without increasing alignment to the degree necessary to stave off disaster.
It may be the case that the particular technique for building very powerful AI that is easiest to use is a technique that makes alignment and capability increase together, so you usually get the alignment you need just in the course of trying to make your system more capable.
Depending on how you look at that possibility, you could say that’s an example of the ‘special effort’ being not as difficult as it appeared / likely to be made by default, or that the claim is just wrong and the default outcome is not doom. I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way—there might be a further fact that says why OT and IC don’t apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.
For the reasons you give, the Orthogonality thesis and instrumental convergence do shift the burden of proof to explaining why you wouldn’t get misalignment, especially if progress is fast. But such reasons have been given, see e.g. this from Stuart Russell:
And there are outside-view analogies with other technologies that suggests that by default alignment and capability do tend to covary to quite a large extent. This is a large part of Ben Garfinkel’s argument.
But I do think that some people (maybe not Bostrom, based on the caveats he gave), didn’t realise that they did also need to complete the argument to have a strong expectation of doom—to show that there isn’t an easy, and required alignment technique that we’ll have a strong incentive to use.
From my earlier post:
I also think that, especially when you bring Mesa-optimisers or recent evidence into the picture, the evidence we have so far suggests that even though alignment and capability are likely to covary to some degree (a degree higher than e.g. Bostrom expected back before modern ML), the default outcome is still misalignment.
I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL.
I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I proved that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like “optimal farsighted agents tend to preserve their access to terminal states” (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and “optimal agents generally choose paths through the future that afford strictly more options” (Generalizing the Power-Seeking Theorems, Theorem 2).
The theorems aren’t conclusive evidence:
maybe we don’t get AGI through RL
learned policies are not going to be optimal
the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
perhaps the agents inherit different mesa objectives during training
The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.
But the theorems are evidence that RL leads to catastrophe at optimum, at least. We’re not just talking about “the space of all possible minds and desires” anymore.
Also
In the linked slides, the following point is made in slide 43:
On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here’s why I think it’s weak:
I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it’s too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that “most” minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.
I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:
It may be useful to distinguish between “power-seeking behavior” and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior—it’s akin to avoiding “State 2″ in your diagram—but it is something that we’d want any good house-cleaning robot to do. It’s only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can’t shut you off.)
This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and “power-seeking behavior” (in the context of some particular MDP) does reduce the ominousness of the result to some degree.
Ultimately, for most real-world tasks, I think it’s unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM “scalable agent alignment” paper, Rohin’s “narrow value learning” sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like “pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well.” There’s seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it’s reasonable to be optimistic that we’ll do much better than random when producing the policies of advanced AI systems.
I do still think that the ICT is true, though, and I do still think that it matters: it’s (basically) necessary for establishing a high level of misalignment risk. I just don’t think it’s sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).
RL with a randomly chosen reward leads to catastrophe at optimum.
The proof is for randomly distributed rewards.
Ben’s main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.
(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don’t error.)
I do agree that this isn’t enough reason to say “there is no risk”, but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I think I agree with most of what you’re arguing.
[Deleted]
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
I’d rewrite this as:
[Deleted]
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.
Great observation. Similarly, a hypothesis called “Maximum Causal Entropy” once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don’t perpetually maximize their potential partners—they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
[Deleted]
To clarify, I don’t assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.
But, maybe you’re saying “for realistic distributions, the agent won’t get any reward for being shut off and therefore π∗ won’t ever let itself be shut off”. I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient “obedience” rewards, what then? For some level of farsightedness (γ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.
The paper doesn’t draw the causal diagram “Power → instrumental convergence”, it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.
Yes, right. The point isn’t that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.
Yes, if r=h, introduce the agent. You can formalize a kind of “alignment capability” by introducing a joint distribution over the human’s goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we’d implement an agent with goal X’, and so on. You then take our expected optimal value under this distribution and find whether you’re good at alignment, or whether you’re bad and you’ll build agents whose optimal policies tend to obstruct you.
The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn’t.
If the rewards are ϵ-close in sup-norm, then you can get nice regret bounds, sure.
[Deleted]
The freshly updated paper answers this question in great detail; see section 6 and also appendix B.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
Power is kinda weird when defined for optimal agents, as you say—when γ=1, POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.
I mostly agree with what you say here—which is why I said the criticisms were exaggerated, not totally wrong—but I do think the classic arguments are still better than you portray them. In particular, I don’t remember coming away from Superintelligence (I read it when it first came out) thinking that we’d have an AI system capable of optimizing any goal and we’d need to figure out what goal to put into it. Instead I thought that we’d be building AI through some sort of iterative process where we look at existing systems, come up with tweaks, build a new and better system, etc. and that if we kept with the default strategy (which is to select for and aim for systems with the most impressive capabilities/intelligence, and not care about their alignment—just look at literally every AI system made in the lab so far! Is AlphaGo trained to be benevolent? Is AlphaStar? Is GPT? Etc.) then probably doom.
It’s true that when people are building systems not for purposes of research, but for purposes of economic application—e.g. Alexa, Google Search, facebook’s recommendation algorithm—then they seem to put at least some effort into making the systems aligned as well as intelligent. However history also tells us that not very much effort is put in, by default, and that these systems would totally kill us all if they were smarter. Moreover, usually systems appear in research-land first before they appear in economic-application-land. This is what I remember myself thinking in 2014, and I still think it now. I think the burden of proof has totally not been met; we still don’t have good reason to think the outcome will probably be non-doom in the absence of more AI safety effort.
It’s possible my memory is wrong though. I should reread the relevant passages.
When I wrote that I was mostly taking what Ben Garfinkel said about the ‘classic arguments’ at face value, but I do recall that there used to be a lot of loose talk about putting values into an AGI after building it.
I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: “It is physically possible to build an AI system that would try to kill everyone.” The ICT basically says: “Most possible AI systems within some particular set would try to kill everyone.” If we stop here, then we haven’t gotten very far.
To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It’s true that it’s physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth’s atmosphere would also send astronauts aimlessly out into the depths of space. But I don’t think it’d be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don’t want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you’re not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.
Even sillier example: It’s possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don’t need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don’t currently know how to make an effective coronavirus vaccine.
I think the OT and ICT on their own almost definitely aren’t enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.
(Although a lot depends on how high a credence we’re trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)
I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don’t need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn’t need to be met.
But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment