johnswentworth comments on A positive case for how we might succeed at prosaic AI alignment

johnswentworth 18 Nov 2021 19:52 UTC
LW: 13 AF: 11
AF
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it’s imitating. In particular, if it’s imitating humans working on alignment, then it’s at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
- It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
- It correctly imitates the humans
Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan’s work enter the picture.