My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don’t know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I’m not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.
My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don’t know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I’m not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.
As one specific example—has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?
https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf