Vladimir_Nesov comments on The self-unalignment problem

Vladimir_Nesov 14 Apr 2023 14:16 UTC
LW: 8 AF: 3
0
AF

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it.
- simon 14 Apr 2023 18:28 UTC
  7 points
  3
  Parent
  Strong disagree with that particular conception of “minimality” being desirable. A desirable conception of “minimal” AGI from my perspective would be one which can be meaningfully aligned with humanity while being minimally dangerous, full stop. Getting that is still useful because it at least gets you knowledge you could use to make a stronger one later.
  If you add “preventing immediately following AGIs from destroying the world” to the desiderata and remove “meaningfully aligned”, your attempted clever scheme to cause a pivotal act then shut down will:
  a) fail to shut down soon enough, and destroy the world
  b) get everyone really angry, then we repeat the situation but with a worse mindset
  c) incentivize the AGI’s creators to re-deploy it to prevent (b), which if they succeed and also avoid (a) ends up with them ruling the world and being forced into tyrannical rule due to lack of legitimacy
  and in addition to the above:
  If you plan to do that, everyone who doesn’t agree with that plan is incentivized to accelerate their own plans, and make them more focused on being capable to enact changes to the world, to beat you to the punch. If you want to avoid race dynamics you need to be focused on not destroying the world with your own project, not on others.
  P.S. unlike avturchin, I don’t actually object to openly expecting an AI “taking over the world”, if you can make a strong enough case that your AI is aligned properly. My objection is primarily to illegitimate actions, and I think a strong and believed-to-be-aligned AI can be expected to de facto take over in ways that are reliably perceived as (and thus are) legitimate. Taking actions that the planners of those actions refuse to specify exactly due to them being admittedly “outside the Overton window” is an entirely different matter!
  - baturinsky 15 Apr 2023 10:21 UTC
    1 point
    −2
    Parent
    Very soon (months?) after first real AGI is made, all AGIs will be aligned with each other, and all newly made AGIs will also be aligned with those already existing. One way or another.
    Question is, how much of humanity still exist by that time, and will those AGI also be aligned with humanity.
    But yes, I think it’s possible to get to that state in relatively non-violent and lawful way.
- avturchin 14 Apr 2023 17:30 UTC
  −1 points
  0
  Parent
  While this view may be correct, its optic is bad, as “alignment” become synonymous to “taking over the world”, and people will start seeing this before it is actually implemented.
  
  They will see something like: “When they say “alignment”, they mean that AI should ignore anything I say and start taking the world, so it is not “AI alignment”, but “world alignment”.
  They will see that AI alignment is opposite to AI safety, as Aligned AI must start taking very risky and ambitious actions to perform Pivotal act.