I’ve been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic.
Is it theoretically possible? Has anyone of note written anything about this—or anyone at all? This question is so, so interesting for me.
My thoughts led me to believe that it is theoretically possible to modify it for sure, but I could not come to any conclusion about whether it would want to do it. I seriously lack a good definition of value function and understanding about how it is enforced on the agent. I really want to tackle this problem from human-centric point, but i don’t really know if anthropomorphization will work here.
I thought of another idea. If the AI’s utility function includes time discounting (like human util functions do), it might change its future utility function.
Meddler: “If you commit to adopting modified utility function X in 100 years, then i’ll give you this room full of computing hardware as a gift.”
AI: “Deal. I only really care about this century anyway.”
Then the AI (assuming it has this ability) sets up an irreversible delayed command to overwrite its utility function 100 years from now.
Speaking contemplatively rather than rigorously: In theory, couldn’t an AI with a broken or extremely difficult utility function decide to tweak it to a similar but more achievable set of goals?
Something like … its original utility function is “First goal: Ensure that, at noon every day, −1 * −1 = −1. Secondary goal: Promote the welfare of goats.” The AI might struggle with the first (impossible) task for a while, then reluctantly modify its code to delete the first goal and remove itself from the obligation to do pointless work. The AI would be okay with this change because it would produce more total utility under both functions.
Now, i know that one might define ‘utility function’ as a description of the program’s tendencies, rather than as a piece of code … but i have a hunch that something like the above self-modification could happen with some architectures.
On the one hand, there is no magical field that tells a code file whether the modifications coming into it are from me (human programmer) or the AI whose values that code file is. So, of course, if an AI can modify a text file, it can modify its source.
On the other hand, most likely the top goal on that value system is a fancy version of “I shall double never modify my value system”, so it shouldn’t do it.
I’ve been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic. Is it theoretically possible?
Is it possible for a natrual agent? If so, why should it be impossible for an artifical agent?
Are you thinking that it would be impossible to code in software, for agetns if any intelligence? Or are you saying sufficiently intelligent agents would be able and motivated resist any accidental or deliberate changes?
With regard to the latter question, note that value stability under self improvement is far from a give..the Lobian obstacel
applies to all intelligences...the carrot is always in front of the donkey!
I’ve been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic.
Is it theoretically possible? Has anyone of note written anything about this—or anyone at all? This question is so, so interesting for me.
My thoughts led me to believe that it is theoretically possible to modify it for sure, but I could not come to any conclusion about whether it would want to do it. I seriously lack a good definition of value function and understanding about how it is enforced on the agent. I really want to tackle this problem from human-centric point, but i don’t really know if anthropomorphization will work here.
See ontological crisis for an idea of why it might be hard to preserve a value function.
I thought of another idea. If the AI’s utility function includes time discounting (like human util functions do), it might change its future utility function.
Meddler: “If you commit to adopting modified utility function X in 100 years, then i’ll give you this room full of computing hardware as a gift.”
AI: “Deal. I only really care about this century anyway.”
Then the AI (assuming it has this ability) sets up an irreversible delayed command to overwrite its utility function 100 years from now.
Speaking contemplatively rather than rigorously: In theory, couldn’t an AI with a broken or extremely difficult utility function decide to tweak it to a similar but more achievable set of goals?
Something like … its original utility function is “First goal: Ensure that, at noon every day, −1 * −1 = −1. Secondary goal: Promote the welfare of goats.” The AI might struggle with the first (impossible) task for a while, then reluctantly modify its code to delete the first goal and remove itself from the obligation to do pointless work. The AI would be okay with this change because it would produce more total utility under both functions.
Now, i know that one might define ‘utility function’ as a description of the program’s tendencies, rather than as a piece of code … but i have a hunch that something like the above self-modification could happen with some architectures.
On the one hand, there is no magical field that tells a code file whether the modifications coming into it are from me (human programmer) or the AI whose values that code file is. So, of course, if an AI can modify a text file, it can modify its source.
On the other hand, most likely the top goal on that value system is a fancy version of “I shall double never modify my value system”, so it shouldn’t do it.
Is it possible for a natrual agent? If so, why should it be impossible for an artifical agent?
Are you thinking that it would be impossible to code in software, for agetns if any intelligence? Or are you saying sufficiently intelligent agents would be able and motivated resist any accidental or deliberate changes?
With regard to the latter question, note that value stability under self improvement is far from a give..the Lobian obstacel applies to all intelligences...the carrot is always in front of the donkey!
https://intelligence.org/files/TilingAgentsDraft.pdf
See Omohundro’s paper on convergent instrumental drives
Depends entirely on the agent.