The heuristic is “assemblage is safer than its primitives”.
Formally:
For every primitive p and assemblages A1 and A2 and wiring diagram D, the following is true:
If D∘(A1⊗p) strongly dominates A1 then D∘(A2⊗p) weakly dominates A2.
Recall that D∘(A⊗p) is the wiring-together of A and p using the wiring diagram D.
In English, this says that p can’t be helpful in one assemblage and unhelpful in another.
I expect counterexamples to this heuristic to look like this:
Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
Can you please take a look at this comment?
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
The heuristic is “assemblage is safer than its primitives”.
Formally:
For every primitive p and assemblages A1 and A2 and wiring diagram D, the following is true:
If D∘(A1⊗p) strongly dominates A1 then D∘(A2⊗p) weakly dominates A2.
Recall that D∘(A⊗p) is the wiring-together of A and p using the wiring diagram D.
In English, this says that p can’t be helpful in one assemblage and unhelpful in another.
I expect counterexamples to this heuristic to look like this:
Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
Can you please take a look at this comment?