You could probably train a non-dangerous ML model that has superhuman theorem-proving abilities, but we don’t know how to formalize the alignment problem in a way that we can feed it into a theorem prover.
A model that can “solve alignment” for us would be a consequentialist agent explicitly modeling humans, and dangerous by default.
We might be able to formalize some pieces of the alignment problem, like MIRI tried with corrigibility. Also Vanessa Kosoy has some more formal work, too. Do you think there are no useful pieces to formalize? Or that all the pieces we try to formalize won’t together be enough even if we had solutions to them?
Also, even if it explicitly models humans, would it need to be consequentialist? Could we just have a powerful modeller trained to minimize prediction loss or whatever? The search space may be huge, but having a powerful modeller still seems plausibly useful. We could also filter options, possibly with a separate AI, not necessarily an AGI.
You could probably train a non-dangerous ML model that has superhuman theorem-proving abilities, but we don’t know how to formalize the alignment problem in a way that we can feed it into a theorem prover.
A model that can “solve alignment” for us would be a consequentialist agent explicitly modeling humans, and dangerous by default.
We might be able to formalize some pieces of the alignment problem, like MIRI tried with corrigibility. Also Vanessa Kosoy has some more formal work, too. Do you think there are no useful pieces to formalize? Or that all the pieces we try to formalize won’t together be enough even if we had solutions to them?
Also, even if it explicitly models humans, would it need to be consequentialist? Could we just have a powerful modeller trained to minimize prediction loss or whatever? The search space may be huge, but having a powerful modeller still seems plausibly useful. We could also filter options, possibly with a separate AI, not necessarily an AGI.