leogao comments on On A List of Lethalities

leogao 14 Jun 2022 6:01 UTC
2 points
I basically agree with most of the post, but there are a few points where I have some value to add:

#29 (consequences of actions): relevant post. I think this problem is possibly reducible to ELK.

#32 (words only trace real thoughts): My understanding of the original point: So the reason we would want to train an AI that imitates a human’s thoughts is ideally to create an AI that, internally, uses the same algorithm to come to its answer as the human did. The safety properties come from the fact that the actual algorithms generalizes the same way the human algorithm generalizes (related to #10). One can debate whether humans are powerful/aligned enough even if this were the case in theory, but that’s orthogonal. The problem pointed at here is that systems that are powerful at imitating human thought would not necessarily be using the same algorithm as humans use. This other algorithm could generalize in weird ways, and the fact that human explanations don’t reveal all or even most of our actual reasoning makes it harder to learn the human algorithm because it’s less well specified by the data. In particular, one very concerning type of not-the-same-algorithm is when the system is some mesaoptimizer trying to understand how humans think, and if this happens if kind of defeats a lot of the point of doing this imitation thing in the first place.

#33 (interpretability stuff):

I think the existence of features that we can understand is some evidence but not perfect evidence for natural abstractions. One very ELK-flavored problem is that you can’t tell if a neuron is the “actually deceive humans” neuron or the “humans will realize that this is deceptive” neuron, and it’s not clear which of these concepts is more natural for a NN. Also, even if you can understand most of the network, as long as there is a chunk of the network that you can’t interpret, there could be dangerous stuff hiding in there. I think it’s plausible as an example that we end up with models that contain a mostly-natural-abstractions model of a human that the model uses to figure out how humans would react. If we poke around in such a model we will notice that the “deception” neuron causes the model to be less likely to do a plan, but we have no way of differentiating whether this is because the model is trying to avoid doing deceptive things, or it’s modelling whether humans will be capable of catching it.

I have some other thoughts that I’ll write up as shortforms and edit links into this comment later.
What links here?