Responding first to the general approach to good-enough alignment:
I think I would agree with this if you said “optimization that’s at or below human level” rather than “not ridiculously far above”.
Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great.
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization “sneaks through” this design process is probably not going to have much impact on the agent’s performance, or we would have already caught it.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
But really, mainly, I was making the normative claim. A culture of safety is not one in which “it’s probably fine” is allowed as part of any real argument. Any time someone is tempted to say “it’s probably fine”, it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many “it’s probably fine” arguments; so at best you should carefully count how many you allow yourself.
A relevant empirical claim sitting behind this normative intuition is something like: “without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards”.
If your claim is just that “we’re probably fine” is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.
This all seems pretty closely related to Eliezer’swriting on security mindset.
Some thoughts here:
I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is… a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
I don’t really see why the ML-based approaches don’t satisfy the requirement of being based on security mindset. (I agree “we’re probably fine” does not satisfy that requirement.) Note that there isn’t a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I’m just claiming that ML-based approaches seem like they can be “sufficiently” security-mindset-y.
I might be completely misunderstanding the point Eliezer is trying to make, because it’s stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want.
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.
Responding first to the general approach to good-enough alignment:
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization “sneaks through” this design process is probably not going to have much impact on the agent’s performance, or we would have already caught it.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
If your claim is just that “we’re probably fine” is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.
Some thoughts here:
I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is… a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
I don’t really see why the ML-based approaches don’t satisfy the requirement of being based on security mindset. (I agree “we’re probably fine” does not satisfy that requirement.) Note that there isn’t a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I’m just claiming that ML-based approaches seem like they can be “sufficiently” security-mindset-y.
I might be completely misunderstanding the point Eliezer is trying to make, because it’s stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.