If we’re assuming these are “transformatively-powerful” (as assumed in Buck’s comment), then by definition these models could be transformatively useful for accelerating alignment/control.
(E.g. they are capable of 30xing the rate of quality weighted research output given 1⁄4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are “capable” of this, this may not go well...)
I agree it seems likely that such models won’t be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad.
When I say reducing the risk associated with deploying early transformative models, I include thing like:
An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power.
The AI carefully sabotages a bunch of safety critical work.
The AI heavily advances capabilities while slowing safety work via delibrate sandbagging.
The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking.
Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power.
I don’t expect it to involve a lot of alignment or further control progress
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
Hmm, I feel like we are ending up pushing up against the edges of what we mean by “transformatively powerful” models here. Like, under the classical definition of “transformative AI” we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way).
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that’s different from what I am thinking about.
I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it’s not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems.
I don’t expect it to involve a lot of alignment or further control progress
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
I mostly don’t have any great ideas how to use these systems for alignment or control progress, so it’s a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems?
I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment).
I usually define transformative AI against this sort of benchmark.
I mostly don’t have any great ideas how to use these systems for alignment or control progress
FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems.
And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems. Especially if there is clear evidence of serious misalignment in such systems.
Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into.
And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don’t see much hope for that kind of work happening where humanity doesn’t substantially pause or slow down cutting-edge system development.
If we’re assuming these are “transformatively-powerful” (as assumed in Buck’s comment), then by definition these models could be transformatively useful for accelerating alignment/control.
(E.g. they are capable of 30xing the rate of quality weighted research output given 1⁄4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are “capable” of this, this may not go well...)
I agree it seems likely that such models won’t be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad.
When I say reducing the risk associated with deploying early transformative models, I include thing like:
An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power.
The AI carefully sabotages a bunch of safety critical work.
The AI heavily advances capabilities while slowing safety work via delibrate sandbagging.
The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking.
Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power.
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
Hmm, I feel like we are ending up pushing up against the edges of what we mean by “transformatively powerful” models here. Like, under the classical definition of “transformative AI” we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way).
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that’s different from what I am thinking about.
I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it’s not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems.
I mostly don’t have any great ideas how to use these systems for alignment or control progress, so it’s a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.
I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment).
I usually define transformative AI against this sort of benchmark.
FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems.
And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.
Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into.
And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don’t see much hope for that kind of work happening where humanity doesn’t substantially pause or slow down cutting-edge system development.