Lauro Langosco comments on Lauro Langosco’s Shortform

Lauro Langosco 16 Jun 2023 22:17 UTC
LW: 11 AF: 2
0
AF
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
1. Ability to be deceptively aligned
2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
3. Incentives to break containment exist in a way that is accessible / understandable to the model
4. Ability to break containment
5. Ability to robustly understand human intent
6. Situational awareness
7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
8. Interpretability methods break (or other oversight methods break)
  1. doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
9. Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of ‘full alignment’ (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
- Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
- Too little capability (or too little ‘coherence’) and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
- Nate Showell 17 Jun 2023 2:55 UTC
  2 points
  1
  Parent
  Some other possible thresholds:
  10. Ability to perform gradient hacking
  11. Ability to engage in acausal trade
  12. Ability to become economically self-sustaining outside containment
  13. Ability to self-replicate