That is try and avoid the drives by setting goals such as avoiding changing the world and turning itself off after having achieved a small goal?
“Avoid changing the world” is very hard to formalize. First, take a timeless view: there are no changes, only control over what actually happens. If the AI already exists, then it already exerts some effect on the future, controls it to some extent. “Not changing the world” can at this point only be a particular kind of control the AI exerts over the future. But what kind of control, exactly? And how ruthless would the AI be in pursuit of “not changing the world” as optimally as possible? It might wipe out humanity just to make sure it has enough resources to reliably not change the world in the future.
“Avoid changing the world” is very hard to formalize. First, take a timeless view: there are no changes, only control over what actually happens.
I don’t think it is too hard. The AI can model counterfactuals right? Simply model how the world would progress if the computer had no power but the ball was red. Then attempt to maximise the mutual information of this model with whatever the models of the world the AI creates for the possible actions it can take. The more the model diverges the less mutual information.
This might have failure modes where it makes you think that it had never been switched on and the ball was always red. But I don’t see the same difficulties you do in specification of “changing the world”.
Don’t say “it’s not too hard” before you can actually specify how to do it.
Simply model how the world would progress if the computer had no power but the ball was red.
The ball wasn’t red. What does it even mean that a “ball” is “red” or “not red”? How sure can the AI be that it got the intended meaning correctly, and that the ball is actually as red as possible? Should it convert the mass of the galaxy to a device that ensures optimal redness of the ball?
The difference between non-autonomous tools and AGIs is that AGIs don’t fail to make an arbitrarily large effect on the world. And so if they have a tiny insignificant inclination to sort the rocks on a planet in a distant galaxy in prime heaps, they will turn the universe upside down to make that happen.
Red = “reflect electromagnetic radiation with a spectrum like X”.
If you do not like the Red ball thing, feel free to invent another test, such as flipping a few bits on another computer.
Should it convert the mass of the galaxy to a device that ensures optimal redness of the ball?
No as that would lead to an decrease in mutual information between the two models. It doesn’t care about the ball any more than it does the rest of the universe. This may lead to it doing nothing and not changing the ball colour at all.
The difference between non-autonomous tools and AGIs is that AGIs don’t fail to make an arbitrarily large effect on the world. And so if they have a tiny insignificant inclination to sort the rocks on a planet in a distant galaxy in prime heaps, they will turn the universe upside down to make that happen.
Generally yes. The question is can we design one that has no such inclinations or an inclination to very moderate actions.
If you do not like the Red ball thing, feel free to invent another test, such as flipping a few bits on another computer.
No, it’s the same. Specifying what a physical “computer” is is hard.
No as that would lead to an decrease in mutual information between the two models.
What is a “model”? How does one construct a model of the universe? How detailed must it be, whatever it is? What resources should be expended on making a more accurate model? Given two “models”, how accurate must a calculation of mutual information be? What if it can’t be accurate? What is the tradeoff between making it more accurate and not rewriting the universe with machinery for making it more accurate? Etc.
“Avoid changing the world” is very hard to formalize. First, take a timeless view: there are no changes, only control over what actually happens. If the AI already exists, then it already exerts some effect on the future, controls it to some extent. “Not changing the world” can at this point only be a particular kind of control the AI exerts over the future. But what kind of control, exactly? And how ruthless would the AI be in pursuit of “not changing the world” as optimally as possible? It might wipe out humanity just to make sure it has enough resources to reliably not change the world in the future.
I don’t think it is too hard. The AI can model counterfactuals right? Simply model how the world would progress if the computer had no power but the ball was red. Then attempt to maximise the mutual information of this model with whatever the models of the world the AI creates for the possible actions it can take. The more the model diverges the less mutual information.
This might have failure modes where it makes you think that it had never been switched on and the ball was always red. But I don’t see the same difficulties you do in specification of “changing the world”.
Don’t say “it’s not too hard” before you can actually specify how to do it.
The ball wasn’t red. What does it even mean that a “ball” is “red” or “not red”? How sure can the AI be that it got the intended meaning correctly, and that the ball is actually as red as possible? Should it convert the mass of the galaxy to a device that ensures optimal redness of the ball?
The difference between non-autonomous tools and AGIs is that AGIs don’t fail to make an arbitrarily large effect on the world. And so if they have a tiny insignificant inclination to sort the rocks on a planet in a distant galaxy in prime heaps, they will turn the universe upside down to make that happen.
Red = “reflect electromagnetic radiation with a spectrum like X”.
If you do not like the Red ball thing, feel free to invent another test, such as flipping a few bits on another computer.
No as that would lead to an decrease in mutual information between the two models. It doesn’t care about the ball any more than it does the rest of the universe. This may lead to it doing nothing and not changing the ball colour at all.
Generally yes. The question is can we design one that has no such inclinations or an inclination to very moderate actions.
No, it’s the same. Specifying what a physical “computer” is is hard.
What is a “model”? How does one construct a model of the universe? How detailed must it be, whatever it is? What resources should be expended on making a more accurate model? Given two “models”, how accurate must a calculation of mutual information be? What if it can’t be accurate? What is the tradeoff between making it more accurate and not rewriting the universe with machinery for making it more accurate? Etc.