This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want.
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.
This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.