Suppose an AI service realises that it is able to seize many more resources with which to fulfil its bounded utility function. Would it do so? If no, then it’s not rational with respect to that utility function. If yes, then it seems rather unsafe, and I’m not sure how it fits Eric’s criterion of using “bounded resources”.
Yes, it would. The hope is that there do not exist ways to seize and productively use tons of resources within the bound. (To be clear, I’m imagining a bound on time, i.e. finite horizon, as opposed to a bound on the maximum value of the utility function.)
I agree with Eric’s claim that R&D automation will speed up AI progress. The point of disagreement is more like: when we have AI technology that’s able to do basically all human cognitive tasks (which for want of a better term I’ll call AGI, as an umbrella term to include both CAIS and agent AGI), what will it look like? It’s true that no past technologies have looked like unified agent AGIs—but no past technologies have also looked like distributed systems capable of accomplishing all human tasks either. So it seems like the evolution prior is still the most relevant one.
I don’t really know what to say to this beyond “I disagree”, it seems like a case of reference class tennis. I’m not sure how much we disagree—I do agree that we should put weight on the evolution prior.
I think the whole paradigm of RL is an example of a bias towards thinking about agents with goals, and that as those agents become more powerful, it becomes easier to anthropomorphise them (OpenAI Five being one example where it’s hard not to think of it as a group of agents with goals).
But there were so many other paradigms that did not look like that.
I would withdraw my objection if, for example, most AI researchers took the prospect of AGI from supervised learning as seriously as AGI from RL.
There are lots of good reasons not to expect AGI from supervised learning, most notably that with supervised learning you are limited to human performance.
I claim that this sense of “in the loop” is irrelevant, because it’s equivalent to the AI doing its own thing while the human holds a finger over the stop button. I.e. the AI will be equivalent to current CEOs, the humans will be equivalent to current boards of directors.
I’ve lost sight of what original claim we were disagreeing about here. But I’ll note that I do think that we have significant control over current CEOs, relative to what we imagine with “superintelligent AGI optimizing a long-term goal”.
I think of CEOs as basically the most maximiser-like humans.
I agree with this (and the rest of that paragraph) but I’m not sure what point you’re trying to make there. If you’re saying that a CAIS-CEO would be risky, I agree. This seems markedly different from worries that a CAIS-anything would behave like a long-term goal-directed literally-actually-maximizer.
I then mentioned that to build systems which implement arbitrary tasks, you may need to be operating over arbitrarily long time horizons. But probably this also comes down to how decomposable such things are.
Agreed that decomposability is the crux.
People are arguing for a focus on CAIS without (to my mind) compelling arguments for why we won’t have AGI agents eventually, so I don’t think this is a strawman.
Eventually is the key word here. Conditional on AGI agents existing before CAIS, I certainly agree that we should focus on AGI agent safety, which is the claim I thought you were making. Conditional on CAIS existing before AGI agents, I think it’s a reasonable position to say “let’s focus on CAIS, and then coordinate to either prevent AGI agents from existing or to control them from the outside if they will exist”. In particular, approaches like boxing or supervision by a strong overseer become much more likely to work in a world where CAIS already exists.
Also, there is one person working on CAIS and tens to hundreds working on AGI agents (depending on how you count), so arguing for more of a focus on CAIS doesn’t mean that you think that CAIS is the most important scenario.
This depends on having pretty powerful CAIS and very good global coordination, both of which I think of as unlikely (especially given that in a world where CAIS occurs and isn’t very dangerous, people will probably think that AI safety advocates were wrong about there being existential risk). I’m curious how likely you think this is though?
I don’t find it extremely unlikely that we’ll get something along these lines. I don’t know, maybe something like 5%? (Completely made up number, it’s especially meaningless because I don’t have a concrete enough sense of what counts as CAIS and what counts as good global coordination to make a prediction about it.) But I also think that the actions we need to take look very different in different worlds, so most of this is uncertainty over which world we’re in, as opposed to confidence that we’re screwed except in this 5% probability world.
If agent AGIs are 10x as dangerous, and the probability that we eventually build them is more than 10%, then agent AGIs are the bigger threat.
While this is literally true, I have a bunch of problems with the intended implications:
Saying “10x as dangerous” is misleading. If CAIS leads to >10% x-risk, it is impossible for agent AGI to be 10x as dangerous (ignoring differences in outcomes like s-risks). So by saying “10x as dangerous” you’re making an implicit claim of safety for CAIS. If you phrase it in terms of probabilities, “10x as dangerous” seems much less plausible.
The research you do and actions you take in the world where agent AGI comes first are different from those in the world where CAIS comes first. I expect most research to significantly affect one of those two worlds but not both. So the relevant question is the probability of a particular one of those worlds.
I expect that our understanding of low-probability / edge-case worlds to be very bad, in which case most research aimed at improving these worlds is much more likely to be misguided and useless. This cuts against arguments of the form “We should focus on X even though it is unlikely or hard to understand because if it happens then it would be really bad/dangerous.” Yes, you can apply this to AI safety in general, and yes, I do think that a majority of AI safety research will turn out to be useless, primarily because of this argument.
This is an argument only about importance. As I mentioned above, CAIS is much more neglected, and plausibly is more tractable.
Because they have long-term convergent instrumental goals, and CAIS doesn’t. CAIS only “cares” about self-improvement to the extent that humans are instructing it to do so, but humans are cautious and slow.
Agreed, though I don’t think this is a huge effect. We aren’t cautious and slow about our current AI development because we’re confident it isn’t dangerous; the same can happen in CAIS with basic AI building blocks. But good point, I agree this pushes me to thinking that AGI agents will self-improve faster.
Also because even if building AGI out of task-specific strongly-constrained modules is faster at first, it seems unlikely that it’s anywhere near the optimal architecture for self-improvement.
Idk, that seems plausible to me. I don’t see strong arguments in either direction.
It’s something like “the first half of CAIS comes true, but the services never get good enough to actually be comprehensive/general. Meanwhile fundamental research on agent AGI occurs roughly in parallel, and eventually overtakes CAIS.” As a vague picture, imagine a world in which we’ve applied powerful supervised learning to all industries, and applied RL to all tasks which are either as constrained and well-defined as games, or as cognitively easy as most physical labour, but still don’t have AI which can independently do the most complex cognitive tasks (Turing tests, fundamental research, etc).
I agree that seems like a good model. It doesn’t seem clearly superior to CAIS though.
Yes, it would. The hope is that there do not exist ways to seize and productively use tons of resources within the bound. (To be clear, I’m imagining a bound on time, i.e. finite horizon, as opposed to a bound on the maximum value of the utility function.)
I don’t really know what to say to this beyond “I disagree”, it seems like a case of reference class tennis. I’m not sure how much we disagree—I do agree that we should put weight on the evolution prior.
But there were so many other paradigms that did not look like that.
There are lots of good reasons not to expect AGI from supervised learning, most notably that with supervised learning you are limited to human performance.
I’ve lost sight of what original claim we were disagreeing about here. But I’ll note that I do think that we have significant control over current CEOs, relative to what we imagine with “superintelligent AGI optimizing a long-term goal”.
I agree with this (and the rest of that paragraph) but I’m not sure what point you’re trying to make there. If you’re saying that a CAIS-CEO would be risky, I agree. This seems markedly different from worries that a CAIS-anything would behave like a long-term goal-directed literally-actually-maximizer.
Agreed that decomposability is the crux.
Eventually is the key word here. Conditional on AGI agents existing before CAIS, I certainly agree that we should focus on AGI agent safety, which is the claim I thought you were making. Conditional on CAIS existing before AGI agents, I think it’s a reasonable position to say “let’s focus on CAIS, and then coordinate to either prevent AGI agents from existing or to control them from the outside if they will exist”. In particular, approaches like boxing or supervision by a strong overseer become much more likely to work in a world where CAIS already exists.
Also, there is one person working on CAIS and tens to hundreds working on AGI agents (depending on how you count), so arguing for more of a focus on CAIS doesn’t mean that you think that CAIS is the most important scenario.
I don’t find it extremely unlikely that we’ll get something along these lines. I don’t know, maybe something like 5%? (Completely made up number, it’s especially meaningless because I don’t have a concrete enough sense of what counts as CAIS and what counts as good global coordination to make a prediction about it.) But I also think that the actions we need to take look very different in different worlds, so most of this is uncertainty over which world we’re in, as opposed to confidence that we’re screwed except in this 5% probability world.
While this is literally true, I have a bunch of problems with the intended implications:
Saying “10x as dangerous” is misleading. If CAIS leads to >10% x-risk, it is impossible for agent AGI to be 10x as dangerous (ignoring differences in outcomes like s-risks). So by saying “10x as dangerous” you’re making an implicit claim of safety for CAIS. If you phrase it in terms of probabilities, “10x as dangerous” seems much less plausible.
The research you do and actions you take in the world where agent AGI comes first are different from those in the world where CAIS comes first. I expect most research to significantly affect one of those two worlds but not both. So the relevant question is the probability of a particular one of those worlds.
I expect that our understanding of low-probability / edge-case worlds to be very bad, in which case most research aimed at improving these worlds is much more likely to be misguided and useless. This cuts against arguments of the form “We should focus on X even though it is unlikely or hard to understand because if it happens then it would be really bad/dangerous.” Yes, you can apply this to AI safety in general, and yes, I do think that a majority of AI safety research will turn out to be useless, primarily because of this argument.
This is an argument only about importance. As I mentioned above, CAIS is much more neglected, and plausibly is more tractable.
Agreed, though I don’t think this is a huge effect. We aren’t cautious and slow about our current AI development because we’re confident it isn’t dangerous; the same can happen in CAIS with basic AI building blocks. But good point, I agree this pushes me to thinking that AGI agents will self-improve faster.
Idk, that seems plausible to me. I don’t see strong arguments in either direction.
I agree that seems like a good model. It doesn’t seem clearly superior to CAIS though.