DragonGod comments on DragonGod’s Shortform

DragonGod 3 Mar 2023 15:06 UTC
2 points
I want descriptive theories of intelligent systems to answer questions of the following form.
Consider:
- Goal misgeneralisation, specification gaming/reward hacking, deceptive alignment, gradient hacking/gradient filtering, and other alignment failure modes
- “Goals”/values^[2], world models, interfaces and other aspects of a system’s type signature
- Coherence/goal directedness, value type^[3], reflectivity/embeddeness, other general alignment properties
- Corrigibility/deference, myopia/time horizon, impact regularisation, mild optimisation, other general safety properties
And for each of the above clusters, I want to ask the following questions:
- How likely are they to emerge by default?
  - That is without training processes that actively incentivise or otherwise select for them
  - Which properties/features are “natural”?
  - Which properties/features are “anti-natural”?
- If they do emerge, in what form will they manifest?
  - To what degree is that property/feature exhibited/present in particular systems
- Are they selected for by conventional ML training processes?
  - What kind of training processes select for them?
  - What kind of training processes select against them?
- How does selection for/against these properties trade off against performance, “capabilities”, cost, <other metrics we care about>
I think that answers to these questions would go a long way towards deconfusing us and refining our thinking around:
- The magnitude of risk we face with particular paradigms/approaches
- The most probable failure modes
  - And how to mitigate them
- The likelihood of alignment by default
- Alignment taxes for particular safety properties (and safety in general)