On 2, I would probably make a stronger additional claim, that even the parts of alignment research that are “just capabilities” don’t seem to happen by default, and the vast majority of work done in this space (at least with large models) seems to have been driven by people motivated to work on alignment. Yes, in some ideal world, those working towards AI-based products would have done things like RL from human feedback, but empirically they don’t seem to be doing that.
(I also agree with the response you’ve given in the post.)
Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there’s room for reasonable disagreement on this question, although I favour the former.
On 2, I would probably make a stronger additional claim, that even the parts of alignment research that are “just capabilities” don’t seem to happen by default, and the vast majority of work done in this space (at least with large models) seems to have been driven by people motivated to work on alignment. Yes, in some ideal world, those working towards AI-based products would have done things like RL from human feedback, but empirically they don’t seem to be doing that.
(I also agree with the response you’ve given in the post.)
Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there’s room for reasonable disagreement on this question, although I favour the former.
Yeah that’s also good point, though I don’t want to read too much into it, since it might be a historical accident.