Give the world’s thousand most respected AI researchers $1M each to spend 3 months working on AI alignment, with an extra $100M if by the end they can propose a solution alignment researchers can’t shoot down.
An unfortunate fact is that the existing AI alignment community is bad at coming to consensus with alignment solution proposers about whether various proposals have been “shot down”. For instance, consider the proposed solution to the “off-switch problem” where an AI maintains uncertainty about human preferences, acts to behave well according to those (unknown) preferences, and updates its belief about human preferences based on human behaviour (such as inferring “I should shut down” from the human attempting to shut the AI down), as described in this paper. My sense is that Eliezer Yudkowsky and others think they have shot this proposal down (see the Arbital page on the problem of fully updated deference), but that Stuart Russell, a co-author of the original paper, does not (based on personal communication with him). This suggests that we need significant advances in order to convince people who are not AI safety researchers that their solutions don’t work.
Strongly encourage you to make this comment its own post; the work of laying out “here’s SOTA and here’s why it doesn’t work” in precise and accessible language is invaluable to incentivizing outsiders to work on alignment.
An unfortunate fact is that the existing AI alignment community is bad at coming to consensus with alignment solution proposers about whether various proposals have been “shot down”. For instance, consider the proposed solution to the “off-switch problem” where an AI maintains uncertainty about human preferences, acts to behave well according to those (unknown) preferences, and updates its belief about human preferences based on human behaviour (such as inferring “I should shut down” from the human attempting to shut the AI down), as described in this paper. My sense is that Eliezer Yudkowsky and others think they have shot this proposal down (see the Arbital page on the problem of fully updated deference), but that Stuart Russell, a co-author of the original paper, does not (based on personal communication with him). This suggests that we need significant advances in order to convince people who are not AI safety researchers that their solutions don’t work.
Strongly encourage you to make this comment its own post; the work of laying out “here’s SOTA and here’s why it doesn’t work” in precise and accessible language is invaluable to incentivizing outsiders to work on alignment.