We don’t need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).
So I think it’s important to also note that, getting a neural network to “perform roughly at human-level in an aligned manner” may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:
Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like “being smart” and “being a good person” and “still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy”, is a pretty huge ask.
It seems to me obvious, though this is the sort of point where I’ve been surprised about what other people don’t consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku’s Go play so well that a scholar couldn’t tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator’s abilities in addition to your own.
Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. [...]
One might argue:
So I think it’s important to also note that, getting a neural network to “perform roughly at human-level in an aligned manner” may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification: