Thanks, that’s very helpful. It still feels to me like there’s a significant issue here, but I need to think more. At present I’m too confused to get much beyond handwaving.
A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I had been thinking too much lately of [isolated human] rather than [human process].
I agree the issue I want to point to isn’t precisely OOD generalisation. Rather it’s that the training data won’t be representative of the thing you’d like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I’m worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
It does seem hard to ensure you don’t end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner’s resource levels or motives.
The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
W.r.t. H’s making value calls, my worry isn’t that they’re asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).
I’m going to try writing up the core of my worry in more precise terms. It’s still very possible that any non-trivial substance evaporates under closer scrutiny.
Thanks, that’s very helpful. It still feels to me like there’s a significant issue here, but I need to think more. At present I’m too confused to get much beyond handwaving.
A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I had been thinking too much lately of [isolated human] rather than [human process].
I agree the issue I want to point to isn’t precisely OOD generalisation. Rather it’s that the training data won’t be representative of the thing you’d like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I’m worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
It does seem hard to ensure you don’t end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner’s resource levels or motives.
The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
W.r.t. H’s making value calls, my worry isn’t that they’re asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).
I’m going to try writing up the core of my worry in more precise terms.
It’s still very possible that any non-trivial substance evaporates under closer scrutiny.