Classify different types of objects in the world. Those objects include your “rewards”. A generally intelligent being can do this.
Treat them as a sort of money system. Describe them in terms of each other.
Learn what is the correct money system.
It’ll at least allow us to get rid of some universal AI and AGI bugs. Because you can specify what’s a definitely incorrect “money system” (for a certain task). You can even make the AI predict it.
My examples are inspired by Rob Miles examples.
A. You asked an AI to build a house. The AI destroys a part of an already existing house. And then restores it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for which tasks it’s incorrect.
B. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is an obviously incorrect “value system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
(another draft:)
If you ask an AI (AGI) to do something “as a human would do it”, you achieve safety but severely restrict the AI’s capabilities. No, you want the AI to accomplish a task in the most effective way. But you don’t want it to kill everybody. So, you need one of those things:
Perfect instructions for AI.
Perfect morality for AI.
I think there’s a third way. You can treat AI’s rewards (and objects in the world) as a “money system”. Then you can specify what types of money systems are definitely incorrect. Or even make AI predict it.
It would at least allow us to get rid of some universal AI and AGI bugs. I think that’s interesting.
Alignment idea:
Classify different types of objects in the world. Those objects include your “rewards”. A generally intelligent being can do this.
Treat them as a sort of money system. Describe them in terms of each other.
Learn what is the correct money system.
It’ll at least allow us to get rid of some universal AI and AGI bugs. Because you can specify what’s a definitely incorrect “money system” (for a certain task). You can even make the AI predict it.
My examples are inspired by Rob Miles examples.
A. You asked an AI to build a house. The AI destroys a part of an already existing house. And then restores it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for which tasks it’s incorrect.
B. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is an obviously incorrect “value system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
(another draft:)
If you ask an AI (AGI) to do something “as a human would do it”, you achieve safety but severely restrict the AI’s capabilities. No, you want the AI to accomplish a task in the most effective way. But you don’t want it to kill everybody. So, you need one of those things:
Perfect instructions for AI.
Perfect morality for AI.
I think there’s a third way. You can treat AI’s rewards (and objects in the world) as a “money system”. Then you can specify what types of money systems are definitely incorrect. Or even make AI predict it.
It would at least allow us to get rid of some universal AI and AGI bugs. I think that’s interesting.