Martin Randall comments on Evaluating the historical value misspecification argument

Martin Randall 5 Dec 2024 2:56 UTC
1 point
0
This is good news because this is more in line with my original understanding of your post. It’s a difficult topic because there are multiple closely related problems of varying degrees of lethality and we had updates on many of them between 2007 and 2023. I’m going to try to put the specific update you are pointing at into my own words.

From the perspective of 2007, we don’t know if we can lossilly extract human values into a convenient format using human intelligence and safe tools. We know that a superintelligence can do it (assuming that “human values” is meaningful), but we also know that if we try to do this with an unaligned superintelligence then we all die.

If this problem is unsolvable then we potentially have to create a seed AI using some more accessible value, such as corrigibility, and try to maintain that corrigibility as we ramp up intelligence. This then leads us to the problem of specifying corrigibility, and we see “Corrigibility is anti-natural to consequentialist reasoning” on List of Lethalities.

If this problem is solvable then we can use human values sooner and this gives us other options. Maybe we can find a basin of attraction around human values for example.

The update between 2007 and 2023 is that the problem appears solvable. GPT-4 is a safe tool (it exists and we aren’t extinct yet) and does a decent job. A more focused AI could do the task better without being riskier.

This does not mean that we are not going to die. Yudkowsky has 43 items on List of Lethalities. This post addresses part of item 24. The remaining items are sufficient to kill us ~42.5 times. It’s important to be able to discuss one lethality at a time if we want to die with dignity.