Suppose hypothetically I had a way to make neural networks recognize OOD inputs. (Like I get back 40% dog, 10% cat, 20% OOD, 5% teapot...) Should I run a big imagenet flashy demo (So I personally know if the idea scales up) and then tell no one?
There was reasoning that went. Any research that has a better alignment/ capabilities ratio than the average of all research currently happening is good. A lot of research is pure capabilities, like hardware research. So almost anything with any alignment in it is good. I’m not quite sure if this is a good rule of thumb.
I think I basically don’t buy the “just increase the alignment/capabilities ratio” model, at least on its own. It just isn’t a sufficient condition to not die.
It does feel like there’s a better version of that model waiting to be found, but I’m not sure what it is.
the “just increase the alignment/capabilities ratio” model
If Donald is talking about the reasoning from my post, the primary argument there went a bit different. It was about expanding the AI Safety field by converting extant capabilities researchers/projects; and that even if we can’t make them stop capability research, any intervention that 1) doesn’t speed it up and 2) makes them output alignment results alongside capabilities results is net positive.
I think I’d also argued that the AI Safety is tiny at the moment so we won’t contribute much to capability research even if we deliberately tried, but in retrospect, that argument is obviously invalid in hypotheticals where we’re actually effective at solving alignment.
Model 1. Your new paper produces c units of capabilities, and a units of alignment. When C units of capabilities are reached, an AI is produced, and it is aligned iff A units of alignment has been produced. The rest of the world produces, and will continue to produce, alignment and capabilities research in ratio R. You are highly uncertain about A and/or C, but have a good guess at a,c,R.
In this model, if AC<<R we are screwed whatever you do, if AC>>R we win whatever you do. Your paper makes a difference in those worlds where AC≈R, and in those worlds it helps iff ac>R.
This model treats alignment and capabilities as continuous, fungible quantities that slowly accumulate. This is a dubious assumption. It also assumes that conditional on us being in the marginal world (The world very where good and bad outcomes are both very close) that your mainline probability involves research continuing at its current ratio.
If for example, you were extremely pessimistic, and think that the only way we have any chance is if a portal to Dath ilan opens up, then the goal is largely to hold off all research for as long as possible, to maximize the time a deus ex machina can happen in. Other goals might include publishing the sort of research most likely to encourage a massive global “take AI seriously” movement.
So, the main takeaway is that we need some notion of fungibility/additivity of research progress (for both alignment and capabilities) in order for the “ratio” model to make sense.
Some places fungibility/additivity could come from:
research reducing time-until-threshold-is-reached additively and approximately-independently of other research
probabilistic independence in general
a set of rate-limiting constraints on capabilities/alignment strategies, such that each one must be solved independent of the others (i.e. solving each one does not help solve the others very much)
Fungibility is necessary, but not sufficient for the “if your work has a better ratio than average research, publish”. You also need your uncertainty to be in the right place.
If you were certain of R, and uncertain what ACfuture research might have, you get a different rule, publish if ac>R.
I think “alignment/capabilities > 1” is a closer heuristic than “alignment/capabilities > average”, in the sense of ‘[fraction of remaining alignment this solves] / [fraction of remaining capabilities this solves]’. That’s a sufficient condition if all research does it, though not IRL e.g. given pure capabilities research also exists; but I think it’s still a necessary condition for something to be net helpful.
It feels like what’s missing is more like… gears of how to compare “alignment” to “capabilities” applications for a particular piece of research. Like, what’s the thing I should actually be imagining when thinking about that “ratio”?
Suppose hypothetically I had a way to make neural networks recognize OOD inputs. (Like I get back 40% dog, 10% cat, 20% OOD, 5% teapot...) Should I run a big imagenet flashy demo (So I personally know if the idea scales up) and then tell no one?
There was reasoning that went. Any research that has a better alignment/ capabilities ratio than the average of all research currently happening is good. A lot of research is pure capabilities, like hardware research. So almost anything with any alignment in it is good. I’m not quite sure if this is a good rule of thumb.
I think I basically don’t buy the “just increase the alignment/capabilities ratio” model, at least on its own. It just isn’t a sufficient condition to not die.
It does feel like there’s a better version of that model waiting to be found, but I’m not sure what it is.
Relevant thread
If Donald is talking about the reasoning from my post, the primary argument there went a bit different. It was about expanding the AI Safety field by converting extant capabilities researchers/projects; and that even if we can’t make them stop capability research, any intervention that 1) doesn’t speed it up and 2) makes them output alignment results alongside capabilities results is net positive.
I think I’d also argued that the AI Safety is tiny at the moment so we won’t contribute much to capability research even if we deliberately tried, but in retrospect, that argument is obviously invalid in hypotheticals where we’re actually effective at solving alignment.
Model 1. Your new paper produces c units of capabilities, and a units of alignment. When C units of capabilities are reached, an AI is produced, and it is aligned iff A units of alignment has been produced. The rest of the world produces, and will continue to produce, alignment and capabilities research in ratio R. You are highly uncertain about A and/or C, but have a good guess at a,c,R.
In this model, if AC<<R we are screwed whatever you do, if AC>>R we win whatever you do. Your paper makes a difference in those worlds where AC≈R, and in those worlds it helps iff ac>R.
This model treats alignment and capabilities as continuous, fungible quantities that slowly accumulate. This is a dubious assumption. It also assumes that conditional on us being in the marginal world (The world very where good and bad outcomes are both very close) that your mainline probability involves research continuing at its current ratio.
If for example, you were extremely pessimistic, and think that the only way we have any chance is if a portal to Dath ilan opens up, then the goal is largely to hold off all research for as long as possible, to maximize the time a deus ex machina can happen in. Other goals might include publishing the sort of research most likely to encourage a massive global “take AI seriously” movement.
So, the main takeaway is that we need some notion of fungibility/additivity of research progress (for both alignment and capabilities) in order for the “ratio” model to make sense.
Some places fungibility/additivity could come from:
research reducing time-until-threshold-is-reached additively and approximately-independently of other research
probabilistic independence in general
a set of rate-limiting constraints on capabilities/alignment strategies, such that each one must be solved independent of the others (i.e. solving each one does not help solve the others very much)
???
Fungibility is necessary, but not sufficient for the “if your work has a better ratio than average research, publish”. You also need your uncertainty to be in the right place.
If you were certain of R, and uncertain what ACfuture research might have, you get a different rule, publish if ac>R.
I think “alignment/capabilities > 1” is a closer heuristic than “alignment/capabilities > average”, in the sense of ‘[fraction of remaining alignment this solves] / [fraction of remaining capabilities this solves]’. That’s a sufficient condition if all research does it, though not IRL e.g. given pure capabilities research also exists; but I think it’s still a necessary condition for something to be net helpful.
It feels like what’s missing is more like… gears of how to compare “alignment” to “capabilities” applications for a particular piece of research. Like, what’s the thing I should actually be imagining when thinking about that “ratio”?