There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
To be clear, I don’t think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.)
My general answer would be:
AI Control[1][2]
There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
Buck beat me to the punch here.
To be clear, I don’t think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.)
Thank you! That pretty much completely answers my question.