There is also the ontology identification problem. The two biggest things are: we don’t know how to specify exactly what a diamond is because we don’t know the true base level ontology of the universe. We also don’t know how diamonds will be represented in the AI’s model of the world.
I personally don’t expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, “diamond” will probably still be a pretty natural concept, just like how “apple” didn’t stop being a useful concept after shifting from newtonian mechanics to QM.
Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn’t seem helpful for alignment.
There is also the ontology identification problem. The two biggest things are: we don’t know how to specify exactly what a diamond is because we don’t know the true base level ontology of the universe. We also don’t know how diamonds will be represented in the AI’s model of the world.
I personally don’t expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, “diamond” will probably still be a pretty natural concept, just like how “apple” didn’t stop being a useful concept after shifting from newtonian mechanics to QM.
Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn’t seem helpful for alignment.
(Unsure whether to mark “agree” for the first two paragraphs, or “disagree” for the last line. Leaving this comment instead.)
Marked as “disagree” conditional on you marking “agree”, so you can mark “agree” to accurately express degree of controversy.
OK, I marked “agree.”