My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It’s a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of “goal-directedness”:
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
My sense is that a high level of capability implies (2) but not (1).
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
And that (1) is way more obviously dangerous
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.
If we all die because an AI put super-human amounts of optimization pressure into some goal incompatible with human survival (i.e., almost any goal if the optimization pressure is high enough) it does not matter whether the AI would have had some other goal in some other context.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
The whole approach is pretty hopeless IMHO: I mean the approach of “well, the AI will be wicked smart, but we’ll just make it so that it doesn’t want anything particularly badly or so that what it wants tomorrow will be different from what it wants today”.
It seems fairly certain to me that having a superhuman ability to do things that humans want to be done entails applying strong optimization pressure onto reality—pressure that persists as long as the AI is able to make it persist—forever, ideally, from the point of view of the AI. The two are not separate things like you hope they are. Either the AI is wicked good at steering reality towards a goal or not. If it is wicked good, then either its goal is compatible with continued human survival or not, and if not, we are all dead. If it is not wicked good at steering reality, then no one is going to be able to figure out how to use it to align an AI such that it stays aligned once it is much smarter than us.
I subscribe to MIRI’s current position that most of the hope for continued human survival comes from the (slim) hope that no one builds super-humanly smart AI until there are AI researchers that are significantly smarter and wiser than the current generation of AI designers (which will probably take centuries unless it proves much easier to employ technology to improve human cognition than most people think it is).
But what hope I have for alignment research done by currently-living people comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want—like Eliezer has been saying since 2006 or so.
By “non-world-destroying”, I assume you mean, “non-humanity ending”.
Well, yeah, if there were a way to keep AI models to roughly human capabilities that would be great because they would be unlikely to end humanity and because we could use them to do useful work with less expense (particularly, less energy expense and less CO2 emissions) than the expense of employing people.
But do you know of a safe way of making sure that, e.g., OpenAI’s next major training run will result in a model that is at most roughly human-level in every capability that can be used to end humanity or to put and to keep humanity in a situation that humanity would not want? I sure don’t—even if OpenAI were completely honest and cooperative with us.
The qualifier “safe” is present in the above paragraph / sentence because giving the model access to the internet (or to gullible people or to a compute farm where it can run any program it wants) then seeing what happens is only safe if we assume the thing to be proved, namely, that the model is not capable enough to impose its will on humanity.
But yeah, it is a source of hope (which I didn’t mention when I wrote, “what hope I have . . . comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want”) that someone will develop a method to keep AI capabilities to roughly human level and all labs actually use the method and focus on making the human-level AIs more efficient in resource consumption even during a great-powers war or an arms race between great powers.
I’d be more hopeful if I had ever seen a paper or a blog post by a researcher trying to devise such a method.
For completeness’s sake, let’s also point out that we could ban large training runs now worldwide, then the labs could concentrate on running the models they have now more efficiently and that would be safe (not completely safe, but much much safer than any future timeline we can realistically hope for) and would allow us to derive some of the benefits of the technology.
I do not know of such a way. I find it unlikely that OpenAI’s next training run wil result in a model that could end humanity, but I can provide no guarantees about that.
You seem to be assuming that all models above a certain threshold of capabilites will either exercise strong optimization pressure on the world in pursuit of goals, or will be useless. Put another way, you seem to be conflating capabilities with actually exerted world-optimization pressures.
While I agree that given a wide enough deployment it is likely that a given model will end up exercising its capabilities pretty much to their fullest extent, I hold that it is in principle possible to construct a mind that desires to help and is able to do so, yet also deliberately refrains from applying too much pressure.
Distinguish two notions of “goal-directedness”:
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.
If we all die because an AI put super-human amounts of optimization pressure into some goal incompatible with human survival (i.e., almost any goal if the optimization pressure is high enough) it does not matter whether the AI would have had some other goal in some other context.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
The whole approach is pretty hopeless IMHO: I mean the approach of “well, the AI will be wicked smart, but we’ll just make it so that it doesn’t want anything particularly badly or so that what it wants tomorrow will be different from what it wants today”.
It seems fairly certain to me that having a superhuman ability to do things that humans want to be done entails applying strong optimization pressure onto reality—pressure that persists as long as the AI is able to make it persist—forever, ideally, from the point of view of the AI. The two are not separate things like you hope they are. Either the AI is wicked good at steering reality towards a goal or not. If it is wicked good, then either its goal is compatible with continued human survival or not, and if not, we are all dead. If it is not wicked good at steering reality, then no one is going to be able to figure out how to use it to align an AI such that it stays aligned once it is much smarter than us.
I subscribe to MIRI’s current position that most of the hope for continued human survival comes from the (slim) hope that no one builds super-humanly smart AI until there are AI researchers that are significantly smarter and wiser than the current generation of AI designers (which will probably take centuries unless it proves much easier to employ technology to improve human cognition than most people think it is).
But what hope I have for alignment research done by currently-living people comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want—like Eliezer has been saying since 2006 or so.
An entity could have the ability to apply such strong optimization pressures onto reality, yet decide not to.
Such an entity would be useless to us IMHO.
Surely there exists a non-useless and non-world-destroying amount of optimization pressure?
By “non-world-destroying”, I assume you mean, “non-humanity ending”.
Well, yeah, if there were a way to keep AI models to roughly human capabilities that would be great because they would be unlikely to end humanity and because we could use them to do useful work with less expense (particularly, less energy expense and less CO2 emissions) than the expense of employing people.
But do you know of a safe way of making sure that, e.g., OpenAI’s next major training run will result in a model that is at most roughly human-level in every capability that can be used to end humanity or to put and to keep humanity in a situation that humanity would not want? I sure don’t—even if OpenAI were completely honest and cooperative with us.
The qualifier “safe” is present in the above paragraph / sentence because giving the model access to the internet (or to gullible people or to a compute farm where it can run any program it wants) then seeing what happens is only safe if we assume the thing to be proved, namely, that the model is not capable enough to impose its will on humanity.
But yeah, it is a source of hope (which I didn’t mention when I wrote, “what hope I have . . . comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want”) that someone will develop a method to keep AI capabilities to roughly human level and all labs actually use the method and focus on making the human-level AIs more efficient in resource consumption even during a great-powers war or an arms race between great powers.
I’d be more hopeful if I had ever seen a paper or a blog post by a researcher trying to devise such a method.
For completeness’s sake, let’s also point out that we could ban large training runs now worldwide, then the labs could concentrate on running the models they have now more efficiently and that would be safe (not completely safe, but much much safer than any future timeline we can realistically hope for) and would allow us to derive some of the benefits of the technology.
I do not know of such a way. I find it unlikely that OpenAI’s next training run wil result in a model that could end humanity, but I can provide no guarantees about that.
You seem to be assuming that all models above a certain threshold of capabilites will either exercise strong optimization pressure on the world in pursuit of goals, or will be useless. Put another way, you seem to be conflating capabilities with actually exerted world-optimization pressures.
While I agree that given a wide enough deployment it is likely that a given model will end up exercising its capabilities pretty much to their fullest extent, I hold that it is in principle possible to construct a mind that desires to help and is able to do so, yet also deliberately refrains from applying too much pressure.