ThomasCederborg comments on A problem with the most recently published version of CEV

ThomasCederborg 18 Sep 2024 4:41 UTC
5 points
2
Regarding the political feasibility of PCEV:
PCEV gives a lot of extra power to some people, specifically because those people intrinsically value hurting other humans. This presumably makes PCEV politically impossible in a wide range of political contexts (including negotiations between a few governments). More generally: now that it has been pointed out that PCEV has this feature, the risks from scenarios where PCEV gets successfully implemented has presumably been mostly removed. Because PCEV is probably off the table as a potential alignment target, pretty much regardless of who ends up deciding what alignment target to aim an AI Sovereign at (the CEO of a tech company, a designs team, a few governments, the UN, a global electorate, etc).
PCEV is however just one example of a bad alignment target. Let’s take the perspective of Steve, an ordinary human individual with no special influence over an AI project. The reason that PCEV is dangerous for Steve, is that PCEV (i): adopts preferences that refer to Steve, (ii): in a way that gives Steve no meaningful influence over the decision, of which Steve-referring preferences PCEV will adopt. PCEV is just one possible AI that would adopt preferences about Steve, in a way that Steve would have no meaningful influence over. So, even fully removing the all risks associated with PCEV in particular, does not remove all risks from this more general class of dangerous alignment targets. From Steve’s perspective, the PCEV thought experiment is illustrating a more general danger: risks from scenarios where an AI will adopt preferences that refer to Steve, in a way that Steve will have no meaningful influence over.
Even more generally: scenarios where someone successfully implements some type of bad alignment target still pose a very real risk. Alignment Target Analysis (ATA) is still at a very early stage of development, and these risks are not well understood. ATA is also a very neglected field of research. In other words: there are serious risks that could be mitigated. But those risks are not currently being mitigated. (As a tangent, I think that the best way of looking at ATA is: risk mitigation through the identification of necessary features. As discussed here, identifying features that are necessary can be a useful risk mitigation tool, even if those features are far from sufficient, and even if one is not close to any form of solution)