Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state?
You want to align an AGI. You just started your first programming course in the Computer Science faculty and you have a bright idea: let’s decompose the problem into simpler subproblems!
What you want is to reach a world state that is not a dystopia. For a state S such that S is dystopian, it holds that D(S) is true. If S is not a dystopia, D(S) is false.
You would like to implement the following algorithm:
You start at a random state S.
If D(S):
Change S a little in a non-dystopian direction.
Else:
You aligned AI!
Of course, everything here is vaguely defined. But I still wonder if it makes sense to pose the questions I have before doing the defining work. I guess you might want to interpret the questions under the most reasonable (?) formalizations of the concept above if such a thing is at all possible and not already alignment-complete.
So, we have multiple questions:
1. Is specifying anything of the above (e.g., D) alignment complete?
2. Is evaluating D(S) alignment-complete?
3. Is changing S a little in a non-dystopian direction alignment complete?
4. How would you actually use that pseudo-algorithm though?
Now, there’s a slightly different concept from alignment-completeness (I’d guess), which is “constructing a state U that is such that D(U) = false”. Such a thing hasn’t been done yet as far as I know, and every proposal tends to fail horribly.
So the questions above are worth repeating in this form:
1. Is specifying anything of the above (e.g., D) easier than constructing U?
2. Is evaluating D(S) easier than constructing U?
3. Is changing S a little in a non-dystopian direction easier than constructing U?
4. How would you actually use that pseudo-algorithm though?
- 12 Jan 2023 10:10 UTC; 6 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
This step is interesting. Most actions change something random and unimportant. S is changing through all sorts of processes outside the AI’s control.
It’s not really a binary D or U, it’s a multidimensional space, and humans classify it based on how much D and U they consider it to have. The problem is it may not be monotonic nor smooth, so gradient descent (this is slightly closer to U, let’s do that and repeat) doesn’t work.
3. Yes (modulo local optima)
Use
To focus on verification only.