Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Ability to be deceptively aligned
Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
Incentives to break containment exist in a way that is accessible / understandable to the model
Ability to break containment
Ability to robustly understand human intent
Situational awareness
Coherence / robustly pursuing it’s goal in a diverse set of circumstances
Interpretability methods break (or other oversight methods break)
doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of ‘full alignment’ (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
Too little capability (or too little ‘coherence’) and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Ability to be deceptively aligned
Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
Incentives to break containment exist in a way that is accessible / understandable to the model
Ability to break containment
Ability to robustly understand human intent
Situational awareness
Coherence / robustly pursuing it’s goal in a diverse set of circumstances
Interpretability methods break (or other oversight methods break)
doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of ‘full alignment’ (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
Too little capability (or too little ‘coherence’) and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
Some other possible thresholds:
10. Ability to perform gradient hacking
11. Ability to engage in acausal trade
12. Ability to become economically self-sustaining outside containment
13. Ability to self-replicate