Aaron_Scher comments on Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Aaron_Scher 10 Dec 2022 23:53 UTC
1 point
0
We can also ask about the prior probability $P (A | S)$ . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than $10^{- 6}$
I think this might be too low given a more realistic training process. Specifically, this is one way the future might go: We train models with gradient descent. Said models develop proxy objectives which are correlated with the base objective used in training. They become deceptively aligned etc. Importantly, the proxy objective they developed is correlated with the base objective, which is hopefully correlated with human values. I don’t think this gets you above a ¹⁄₁₀ chance of the model’s objective being good-by-human-lights, but it seems like it could be higher than 10^-6, with the right training setup. Realistically speaking, we’re (hopefully) not just instantiating superintelligences with random utility functions.
I think a crux of sorts is what it means for the universe if a superintelligent AI has a utility function that is closely correlated but not identical to humans’. I suspect this is a pretty bad universe.