I want descriptive theories of intelligent systems to answer questions of the following form.
Consider:
Goal misgeneralisation, specification gaming/reward hacking, deceptive alignment, gradient hacking/gradient filtering, and other alignment failure modes
“Goals”/values[2], world models, interfaces and other aspects of a system’s type signature
Coherence/goal directedness, value type[3], reflectivity/embeddeness, other general alignment properties
Corrigibility/deference, myopia/time horizon, impact regularisation, mild optimisation, other general safety properties
And for each of the above clusters, I want to ask the following questions:
How likely are they to emerge by default?
That is without training processes that actively incentivise or otherwise select for them
Which properties/features are “natural”?
Which properties/features are “anti-natural”?
If they do emerge, in what form will they manifest?
To what degree is that property/feature exhibited/present in particular systems
Are they selected for by conventional ML training processes?
What kind of training processes select for them?
What kind of training processes select against them?
How does selection for/against these properties trade off against performance, “capabilities”, cost, <other metrics we care about>
I think that answers to these questions would go a long way towards deconfusing us and refining our thinking around:
The magnitude of risk we face with particular paradigms/approaches
The most probable failure modes
And how to mitigate them
The likelihood of alignment by default
Alignment taxes for particular safety properties (and safety in general)
I want descriptive theories of intelligent systems to answer questions of the following form.
Consider:
Goal misgeneralisation, specification gaming/reward hacking, deceptive alignment, gradient hacking/gradient filtering, and other alignment failure modes
“Goals”/values[2], world models, interfaces and other aspects of a system’s type signature
Coherence/goal directedness, value type[3], reflectivity/embeddeness, other general alignment properties
Corrigibility/deference, myopia/time horizon, impact regularisation, mild optimisation, other general safety properties
And for each of the above clusters, I want to ask the following questions:
How likely are they to emerge by default?
That is without training processes that actively incentivise or otherwise select for them
Which properties/features are “natural”?
Which properties/features are “anti-natural”?
If they do emerge, in what form will they manifest?
To what degree is that property/feature exhibited/present in particular systems
Are they selected for by conventional ML training processes?
What kind of training processes select for them?
What kind of training processes select against them?
How does selection for/against these properties trade off against performance, “capabilities”, cost, <other metrics we care about>
I think that answers to these questions would go a long way towards deconfusing us and refining our thinking around:
The magnitude of risk we face with particular paradigms/approaches
The most probable failure modes
And how to mitigate them
The likelihood of alignment by default
Alignment taxes for particular safety properties (and safety in general)