The heuristic is “assemblage is safer than its primitives”.
Formally:
For every primitive p and assemblages A1 and A2 and wiring diagram D, the following is true:
If D∘(A1⊗p) strongly dominates A1 then D∘(A2⊗p) weakly dominates A2.
Recall that D∘(A⊗p) is the wiring-together of A and p using the wiring diagram D.
In English, this says that p can’t be helpful in one assemblage and unhelpful in another.
I expect counterexamples to this heuristic to look like this:
Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
Can you please take a look at this comment?
The heuristic is “assemblage is safer than its primitives”.
Formally:
For every primitive p and assemblages A1 and A2 and wiring diagram D, the following is true:
If D∘(A1⊗p) strongly dominates A1 then D∘(A2⊗p) weakly dominates A2.
Recall that D∘(A⊗p) is the wiring-together of A and p using the wiring diagram D.
In English, this says that p can’t be helpful in one assemblage and unhelpful in another.
I expect counterexamples to this heuristic to look like this:
Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
Can you please take a look at this comment?