A big part of how I’m thinking about this is a very related version:
If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?
One can also focus on various types of scheming and when / in which order they’d happen. E.g. I’d be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where ‘later’ can be operationalized in terms of requiring more effective compute.
I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I’d prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.
Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).
It seems like the question you’re asking is close to (2) in my above decomposition. Aren’t you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won’t be viable given realistic delay budgets?
Or at least it seems like it might initially be non-viable, I have some hope for a plan like:
Control early transformative AI
Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
Use those next AIs to do something.
(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)
It seems like the question you’re asking is close to (2) in my above decomposition.
Yup.
Aren’t you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won’t be viable given realistic delay budgets?
Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM ‘gaps’ to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it’s quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment (e.g. ‘the parallelism tradeoff’).
I have some hope for a plan like:
Control early transformative AI
Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
Use those next AIs to do something.
(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)
Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I’m just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment, e.g. about CoT distillation seeming to work decently—helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).
A big part of how I’m thinking about this is a very related version:
If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?
One can also focus on various types of scheming and when / in which order they’d happen. E.g. I’d be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where ‘later’ can be operationalized in terms of requiring more effective compute.
I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I’d prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.
Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).
It seems like the question you’re asking is close to (2) in my above decomposition. Aren’t you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won’t be viable given realistic delay budgets?
Or at least it seems like it might initially be non-viable, I have some hope for a plan like:
Control early transformative AI
Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
Use those next AIs to do something.
(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)
Yup.
Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM ‘gaps’ to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it’s quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment (e.g. ‘the parallelism tradeoff’).
Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I’m just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment, e.g. about CoT distillation seeming to work decently—helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).