Nora_Ammann comments on “Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Nora_Ammann 30 Nov 2023 17:51 UTC
LW: 5 AF: 2
0
AF
Re whether messy goal-seekers can be schemers, you may address this in a different place (and if so forgive me, and I’d appreciate you pointing me to where), but I keep wondering what notion of scheming (or deception, etc.) we should be adopting when, in particular:
- an “internalist” notion, where ‘scheming’ is defined via the “system’s internals”, i.e. roughly: the system has goal A, acts as if it has goal B, until the moment is suitable to reveal it’s true goal A.
- an “externalist” notion, where ‘scheming’ is defined, either, from the perspective of an observer (e.g. I though the system has goal B, maybe I even did a bunch of more or less careful behavioral tests to raise my confidence in this assumption, but in some new salutation, it gets revealed that the system pursues B instead)
- or an externalist notion but defined via the effects on the world that manifest (e.g. from a more ‘bird’s-eye’ perspective, we can observe that the system had a number of concrete (harmful) effects on one or several agents via the mechanisms that those agents misjudged what goal the system is pursuing (therefor e.g. mispredicting its future behaviour, and basing their own actions on this wrong assumption)
It seems to me like all of these notions have different upsides and downsides. For example:
- the internalist notion seems (?) to assume/bake into its definition of scheming a high degree of non-sphexishness/consequentialist cognition
- the observer-dependent notion comes down to being a measure of the observer’s knowledge about the system
- the effects-on-the-world based notion seems plausibly too weak/non mechanistic to be helpful in the context of crafting concrete alignment proposals/safety tooling