I’ve been going through the FAR AI videos from the alignment workshop in December 2023. I’d like people to discuss their thoughts on Shane Legg’s ‘necessary properties’ that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen:
Otherwise, here are some of the details:
All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond):
Good world model
Good reasoning
Specification of the values and ethics to follow
All of these require good capabilities, meaning capabilities and alignment are intertwined.
Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics.
Shane basically thinks that if the above necessary properties are satisfied at a competent human level, then we can construct an agent that will consistently choose the most value-aligned actions. And you can do this via a cognitive loop that scaffolds the agent to do this.
Shane says at the end of this talk:
If you think this is a terrible idea, I want to hear from you. Come talk to me afterwards and tell me what’s wrong with this idea.
Since many of us weren’t at the workshop, I figured I’d share the talk here to discuss it on LW.
The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you’ve placed limitations on your systems that make them safer. Aligned human-ish level AI’s doesn’t provide a victory condition.