Comments on parts of this other than the ITT thing (response to the ITT part is here)...
(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service.
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I still do not think that the team doing RLHF work at OpenAI actually thought about whether this model makes RLHF decrease the chance of human extinction, and deliberated on that in a way which could plausibly have resulted in the project not happening. But I have made that claim maximally easy to falsify if I’m wrong.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)