IMO, the alignment MVP claim Jan is making is approximately ‘’we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’’ and requires:
we can build models that are:
Not dangerous themselves
capable of alignment research
We can use RRM to make them aligned enough that we can get useful research out of them.
We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than is required for aligning the above models]]
We have these models for long enough before danger and/or the models speed up alignment progress by enough that the alignment progress made during this time is comparably large to or larger than the progress made up to that date.
I’d imagine some cruxes to include: - whether it’s possible to build models capable of somewhat superhuman alignment research that do not have inner agents - whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it
IMO, the alignment MVP claim Jan is making is approximately ‘’we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’’
and requires:
we can build models that are:
Not dangerous themselves
capable of alignment research
We can use RRM to make them aligned enough that we can get useful research out of them.
We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than is required for aligning the above models]]
We have these models for long enough before danger and/or the models speed up alignment progress by enough that the alignment progress made during this time is comparably large to or larger than the progress made up to that date.
I’d imagine some cruxes to include:
- whether it’s possible to build models capable of somewhat superhuman alignment research that do not have inner agents
- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it