I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
The idea is that “value” of a thing doesn’t exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this “outside value”. E.g. if you kill everyone to get some money you get nothing.
I think this idea may:
Fix some universal AI bugs. Prevent “AI decides to kill everyone” scenarios.
Give a new way to explore human values. Explain how humans learn values.
“Solve” Goodhart’s Curse and safety/effectiveness tradeoff.
Unify many different Alignment ideas.
Give a new way to formulate properties we want from an AGI.
I don’t have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
Human values may be simple. Or complex, but not in the way you thought they are.
Humans may have a small amount of values. Or big amount, but in an unexpected way.
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
The idea is that “value” of a thing doesn’t exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this “outside value”. E.g. if you kill everyone to get some money you get nothing.
I think this idea may:
Fix some universal AI bugs. Prevent “AI decides to kill everyone” scenarios.
Give a new way to explore human values. Explain how humans learn values.
“Solve” Goodhart’s Curse and safety/effectiveness tradeoff.
Unify many different Alignment ideas.
Give a new way to formulate properties we want from an AGI.
I don’t have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
Human values may be simple. Or complex, but not in the way you thought they are.
Humans may have a small amount of values. Or big amount, but in an unexpected way.
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.