Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.
Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.