I’d be interested in your thoughts on human motivation in HCH and amplification schemes. Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn’t relate to the question/task. [I don’t think you’ve addressed this at all recently—I’ve only come across specifying enlightened judgement precisely]
I’d appreciate if you could say if/where you disagree with the following kind of argument. I’d like to know what I’m missing:
Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H’s output would account for differing levels of enthusiasm, focus, effort, frustration… based in part on H’s attitude to the question and the opportunity cost in answering it directly.
The ‘correct’ (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn’t directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I’d expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can’t be ruled out.
If a system doesn’t always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It’s unclear to me how we avoid throwing the baby out with the bathwater here.
Or would you expect to find Hs for whom such scenarios wouldn’t occur? This seems unlikely to me: opportunity cost would scale with capability, and I’d predict every H would have their price (generally I’m more confident of this for precisely the kinds of H I’d want amplified: rational, altruistic...).
If we can’t find such Hs, doesn’t this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you’d need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).
Even assuming that succeeds, you wouldn’t end up with a general-purpose question-answerer or task-solver: you’d get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn’t seem competitive across enough contexts.
I mostly don’t think this thing is a major issue. I’m not exactly sure where I disagree, but some possibilities:
H isn’t some human isolated from the world, it’s an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree)
I don’t think H is really generalizing OOD, you are actually collecting human data on the kinds of questions that matter (I don’t think any of my proposals rely on that). So the scenario you are talking about is something like the actual people who are implementing H—real people who actually exist and we are actually working with—are being offered payments or extorted or whatever by the datapoints that the actual ML is giving them. That would be considered a bad outcome on many levels (e.g. man that sounds like it’s going to make the job stressful), and you’d be flagging models that systematically produce such outputs (if all is going well they shouldn’t be upweighted), and coaching contractors and discussing the interesting/tricky cases and so on.
H is just not making that many value calls, they are mostly implemented by the process that H answers. Similarly, we’re just not offloading that much of the substantive work to H (e.g. they don’t need to be super creative or wise, we are just asking them to help construct a process that responds appropriately to evidence).
I don’t really know what kind of opportunity cost you have in mind. Yes, if we hire contractors and can’t monitor their work they will sometimes do a sloppy job. And indeed if someone from an ML team is helping run an oversight process there might be some kinds of inputs where they don’t care and slack off? But there seems to be a big mismatch between the way this scenario is being described and a realistic process for producing of training data.
Most of the errors that H might make don’t seem like they contribute to large-scale consequentialist behavior within HCH, and mostly just doesn’t seem like a big deal or serious problem. We think a lot about kinds of errors that H might make that aren’t noise, e.g. systematic divergences between what contractors do and what we want them to do, and it seems easy for them to be worse than random (and that’s something we can monitor) but there’s a lot of room between that and “undermines benignness.”
Overall it seems like the salient issue is whether sufficiently ML-optimized outputs can lead to malign behavior by H (in which case it is likely also leading to crazy stuff in the outside world), but I don’t think that motivational issues for H are a large part of the story (those cases would be hard for any humans, and this is a smaller source of variance than other kinds of variation in H’s competence or our other tools for handling scary dynamics in HCH).
Thanks, that’s very helpful. It still feels to me like there’s a significant issue here, but I need to think more. At present I’m too confused to get much beyond handwaving.
A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I had been thinking too much lately of [isolated human] rather than [human process].
I agree the issue I want to point to isn’t precisely OOD generalisation. Rather it’s that the training data won’t be representative of the thing you’d like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I’m worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
It does seem hard to ensure you don’t end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner’s resource levels or motives.
The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
W.r.t. H’s making value calls, my worry isn’t that they’re asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).
I’m going to try writing up the core of my worry in more precise terms. It’s still very possible that any non-trivial substance evaporates under closer scrutiny.
I’d be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn’t relate to the question/task.
[I don’t think you’ve addressed this at all recently—I’ve only come across specifying enlightened judgement precisely]
I’d appreciate if you could say if/where you disagree with the following kind of argument.
I’d like to know what I’m missing:
Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H’s output would account for differing levels of enthusiasm, focus, effort, frustration… based in part on H’s attitude to the question and the opportunity cost in answering it directly.
The ‘correct’ (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn’t directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I’d expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can’t be ruled out.
If a system doesn’t always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It’s unclear to me how we avoid throwing the baby out with the bathwater here.
Or would you expect to find Hs for whom such scenarios wouldn’t occur? This seems unlikely to me: opportunity cost would scale with capability, and I’d predict every H would have their price (generally I’m more confident of this for precisely the kinds of H I’d want amplified: rational, altruistic...).
If we can’t find such Hs, doesn’t this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you’d need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).
Even assuming that succeeds, you wouldn’t end up with a general-purpose question-answerer or task-solver: you’d get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn’t seem competitive across enough contexts.
...but hopefully I’m missing something.
I mostly don’t think this thing is a major issue. I’m not exactly sure where I disagree, but some possibilities:
H isn’t some human isolated from the world, it’s an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree)
I don’t think H is really generalizing OOD, you are actually collecting human data on the kinds of questions that matter (I don’t think any of my proposals rely on that). So the scenario you are talking about is something like the actual people who are implementing H—real people who actually exist and we are actually working with—are being offered payments or extorted or whatever by the datapoints that the actual ML is giving them. That would be considered a bad outcome on many levels (e.g. man that sounds like it’s going to make the job stressful), and you’d be flagging models that systematically produce such outputs (if all is going well they shouldn’t be upweighted), and coaching contractors and discussing the interesting/tricky cases and so on.
H is just not making that many value calls, they are mostly implemented by the process that H answers. Similarly, we’re just not offloading that much of the substantive work to H (e.g. they don’t need to be super creative or wise, we are just asking them to help construct a process that responds appropriately to evidence).
I don’t really know what kind of opportunity cost you have in mind. Yes, if we hire contractors and can’t monitor their work they will sometimes do a sloppy job. And indeed if someone from an ML team is helping run an oversight process there might be some kinds of inputs where they don’t care and slack off? But there seems to be a big mismatch between the way this scenario is being described and a realistic process for producing of training data.
Most of the errors that H might make don’t seem like they contribute to large-scale consequentialist behavior within HCH, and mostly just doesn’t seem like a big deal or serious problem. We think a lot about kinds of errors that H might make that aren’t noise, e.g. systematic divergences between what contractors do and what we want them to do, and it seems easy for them to be worse than random (and that’s something we can monitor) but there’s a lot of room between that and “undermines benignness.”
Overall it seems like the salient issue is whether sufficiently ML-optimized outputs can lead to malign behavior by H (in which case it is likely also leading to crazy stuff in the outside world), but I don’t think that motivational issues for H are a large part of the story (those cases would be hard for any humans, and this is a smaller source of variance than other kinds of variation in H’s competence or our other tools for handling scary dynamics in HCH).
Thanks, that’s very helpful. It still feels to me like there’s a significant issue here, but I need to think more. At present I’m too confused to get much beyond handwaving.
A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I had been thinking too much lately of [isolated human] rather than [human process].
I agree the issue I want to point to isn’t precisely OOD generalisation. Rather it’s that the training data won’t be representative of the thing you’d like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I’m worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
It does seem hard to ensure you don’t end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner’s resource levels or motives.
The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
W.r.t. H’s making value calls, my worry isn’t that they’re asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).
I’m going to try writing up the core of my worry in more precise terms.
It’s still very possible that any non-trivial substance evaporates under closer scrutiny.