I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be ‘alignment generalization’ in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on ‘alignment generalization’.
Indeed, this is what I mean.