Meta: I was going to write a post “Subtle wireheading: gliding on the surface of the outer world” which describe most AI aliment failures as a forms of subtle wireheading, but will put its draft here.
Typically, it is claimed that advance AI will be immune to wireheading as it will know that manipulating own reward function is wireheading and thus will not perform it but instead will try to reach goals in the outer world.
However, even acting in real world, such AI will choose the way which requires least effort to create maximum utility therefore simultaneously satisfying the condition that it changes only the “outer world” but not the reward center – or the ways to perceive the outside world.
For example, a papercliper may conclude that in the infinite universe there is an infinite number of paperclips and stops after that. While it does not look like a typical alignment failure, it could be still dangerous.
Formally, we can said that AI will choose the shortest path to the reward in the real word that it knows is not 1) wireheading or 2) perception manipulation. We could add more points in the list of prohibited shortest paths, like: 3) not connected with modal realism and infinities. But if we add more points, we will never know that there is enough of them.
Anyway, AI will choose the cheapest allowed way to the result. Goodharting is often produce the quicker reward by the use of proxi utility function.
Moreover, AI’s subsystems will try to target AI’s reward center to create illusion that they are working better than they actually are. It often happens in bureaucratic machines. In that case, AI is not formally reward hacking itself, but de facto it is.
If we completely ban any shortcuts, we will lose the most interesting part of AI’s creativity: the ability to find new solutions.
Many human failure modes could be also described as forms of wireheading: onanism, theft, feeling of self-importantce etc.
This is interesting — maybe the “meta Lebowski” rule should be something like “No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn’t perceive itself as hacking its reward function.” One goes after the cheapest shortcut that one can justify.
I met the idea of Lebowski theorem as an argument which explains the Fermi paradox: all advance civilizations or AIs wirehead themselves. But here I am not convinced.
For example, if civilization consists of many advance individuals and many of them wirehead themselves, then remaining will be under pressure of Darwinian evolution and eventually only the ones survive who find the ways to perform space exploration without wireheading. Maybe they will be some limited specialized minds with very specific ways of thinking – and this could explain absurdity of observed UAP behaviour.
“No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn’t perceive itself as hacking its reward function.”
Firstly “Bother” and “harder” are strange words to use. Are we assuming lazy AI?
Suppose action X would hack the AI’s reward signal. The AI is totally clueless of this, has no reason to consider X and doesn’t do X.
If the AI knows what X does, it still doesn’t do it.
I think the AI would need some sort of doublethink, to realize that X hacked it’s reward, yet also not realize this.
I also think this claim is factually false. Many humans can and do set out towards goals far harder than accessing large amounts of psychoactives.
Meta: I was going to write a post “Subtle wireheading: gliding on the surface of the outer world” which describe most AI aliment failures as a forms of subtle wireheading, but will put its draft here.
Typically, it is claimed that advance AI will be immune to wireheading as it will know that manipulating own reward function is wireheading and thus will not perform it but instead will try to reach goals in the outer world.
However, even acting in real world, such AI will choose the way which requires least effort to create maximum utility therefore simultaneously satisfying the condition that it changes only the “outer world” but not the reward center – or the ways to perceive the outside world.
For example, a papercliper may conclude that in the infinite universe there is an infinite number of paperclips and stops after that. While it does not look like a typical alignment failure, it could be still dangerous.
Formally, we can said that AI will choose the shortest path to the reward in the real word that it knows is not 1) wireheading or 2) perception manipulation. We could add more points in the list of prohibited shortest paths, like: 3) not connected with modal realism and infinities. But if we add more points, we will never know that there is enough of them.
Anyway, AI will choose the cheapest allowed way to the result. Goodharting is often produce the quicker reward by the use of proxi utility function.
Moreover, AI’s subsystems will try to target AI’s reward center to create illusion that they are working better than they actually are. It often happens in bureaucratic machines. In that case, AI is not formally reward hacking itself, but de facto it is.
If we completely ban any shortcuts, we will lose the most interesting part of AI’s creativity: the ability to find new solutions.
Many human failure modes could be also described as forms of wireheading: onanism, theft, feeling of self-importantce etc.
This is interesting — maybe the “meta Lebowski” rule should be something like “No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn’t perceive itself as hacking its reward function.” One goes after the cheapest shortcut that one can justify.
I met the idea of Lebowski theorem as an argument which explains the Fermi paradox: all advance civilizations or AIs wirehead themselves. But here I am not convinced.
For example, if civilization consists of many advance individuals and many of them wirehead themselves, then remaining will be under pressure of Darwinian evolution and eventually only the ones survive who find the ways to perform space exploration without wireheading. Maybe they will be some limited specialized minds with very specific ways of thinking – and this could explain absurdity of observed UAP behaviour.
Actually, I explored more about wireheading here: “Wireheading as a Possible Contributor to Civilizational Decline”.
Yes, very good formulation. I would add “and most AI aligning failures are types of meta Lebowski rule”
Firstly “Bother” and “harder” are strange words to use. Are we assuming lazy AI?
Suppose action X would hack the AI’s reward signal. The AI is totally clueless of this, has no reason to consider X and doesn’t do X.
If the AI knows what X does, it still doesn’t do it.
I think the AI would need some sort of doublethink, to realize that X hacked it’s reward, yet also not realize this.
I also think this claim is factually false. Many humans can and do set out towards goals far harder than accessing large amounts of psychoactives.
I sent my above comment for the following competition and recommend you to send your post too https://ftxfuturefund.org/announcing-the-future-funds-ai-worldview-prize/
No. People with free will do activities we consider meaningful, even when it isn’t a source of escapism.