The first step of capability amplification is a subhuman AI similar in kind to the AI we have today; so if this is someone’s objection then they ought to be able to stick their neck out today (e.g. by saying that we can’t solve the alignment problem for systems we build today, or by saying that systems we can build today definitely won’t be able to participate in amplification).
It seems non-obvious that the systems we have today can be aligned with human values. They certainly aren’t smart enough to model all of human morality, but they may be able to have some corrigibility properties? This presents the research directions of:
Train a model to have corrigibility properties, as an existence proof. This also provides the opportunity to study the architecture of such a model.
Develop some theory relating corrigibility properties, and expressiveness of your model.
It seems non-obvious that the systems we have today can be aligned with human values. They certainly aren’t smart enough to model all of human morality, but they may be able to have some corrigibility properties? This presents the research directions of:
Train a model to have corrigibility properties, as an existence proof. This also provides the opportunity to study the architecture of such a model.
Develop some theory relating corrigibility properties, and expressiveness of your model.