Evaluating what to do using the current value function is already how model-free and model-based RL works though, right?
Yes, with caveats.
“Model-free” RL usually has a learned value function (unless you go all the way to policy gradients), which is strictly speaking a model, just a very crude one.
These forms of AI usually actively try to update their value function to match the modified versions they get after e.g. destroying the camera.
Part of the problem is that traditional RL is very stupid and basically explores the environment tree using something close to brute force.
I don’t recall reading about any RL algorithms where the agent first estimates the effects that its actions would have on its own embedded code and then judges using the new code.
AIXI?
Anyways, whether it will reject wireheading depends on the actual content of the AI’s value function, no? There’s nothing necessarily preventing its current value function from assigning high value to future states wherein it has intervened causally upstream of the reward dispenser. For instance, tampering with the reward dispenser inputs/content/outputs is a perfect way for the current value function to preserve itself!
As another example, if you initialized the value function from GPT-3 weights, I would bet that it’d evaluate a plan fragment like “type this command into the terminal and then you will feel really really really amazingly good” pretty highly if considered, & that the agent might be tempted to follow that evaluation. Doubly so if the command would in fact lead to a bunch of reward, the model was trained to predict next-timestep rewards directly, and its model rollouts accurately predict the future rewards that would result from the tampering.
Right, more precisely the way to robustly avoid wireheading in advanced AI is to have a model-based agent and have the reward/value be a function of the latent variables in the model, as opposed to a traditional reinforcement learner where the value function approximates the sum of future rewards.
Yes, with caveats.
“Model-free” RL usually has a learned value function (unless you go all the way to policy gradients), which is strictly speaking a model, just a very crude one.
These forms of AI usually actively try to update their value function to match the modified versions they get after e.g. destroying the camera.
Part of the problem is that traditional RL is very stupid and basically explores the environment tree using something close to brute force.
AIXI?
Right, more precisely the way to robustly avoid wireheading in advanced AI is to have a model-based agent and have the reward/value be a function of the latent variables in the model, as opposed to a traditional reinforcement learner where the value function approximates the sum of future rewards.