Humans often really really want something in the world to happen
This sentence is adjacent to my core concern regarding AI alignment, and why I’m not particularly reassured by the difficulty-of-superhuman-performance or return-on-compute reassurances regarding AGI: we don’t need superhuman AI to deal superhuman-seeming amounts of damage. Indeed, even today’s “perfectly-sandboxed” models (in that according to the most reliable publicly-available information none of the most cutting-edge models are allowed direct read/write access to the systems which would allow them to plot and attain world domination or the destruction of humanity (or specific nations’ interests) have the next-best thing: whenever a new technological lever emerges in the world, humans with malicious intentions are empowered to a much greater degree than those who want strictly the best[1] for everybody. There are also bit-flip attacks on aligned AI which are much harder to implement on humans.
Using “best” is fraught but we’ll pretend that “world best-aligned with a Pareto-optimal combination of each person’s expressed reflective preferences and revealed preferences, to the extent that those revealed preferences do not represent akrasia or views and preferences which the person isn’t comfortable expressing directly and publicly but does indeed have” is an adequate proxy to continue along this line of argument; the other option is developing a provably-correct theory of morality and politics which would take more time than this comment by 2-4 orders of magnitude.
This sentence is adjacent to my core concern regarding AI alignment, and why I’m not particularly reassured by the difficulty-of-superhuman-performance or return-on-compute reassurances regarding AGI: we don’t need superhuman AI to deal superhuman-seeming amounts of damage. Indeed, even today’s “perfectly-sandboxed” models (in that according to the most reliable publicly-available information none of the most cutting-edge models are allowed direct read/write access to the systems which would allow them to plot and attain world domination or the destruction of humanity (or specific nations’ interests) have the next-best thing: whenever a new technological lever emerges in the world, humans with malicious intentions are empowered to a much greater degree than those who want strictly the best[1] for everybody. There are also bit-flip attacks on aligned AI which are much harder to implement on humans.
Using “best” is fraught but we’ll pretend that “world best-aligned with a Pareto-optimal combination of each person’s expressed reflective preferences and revealed preferences, to the extent that those revealed preferences do not represent akrasia or views and preferences which the person isn’t comfortable expressing directly and publicly but does indeed have” is an adequate proxy to continue along this line of argument; the other option is developing a provably-correct theory of morality and politics which would take more time than this comment by 2-4 orders of magnitude.