Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.
Here’s a hacky patch that doesn’t entirely solve it, but might help:
Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI’s world model to begin with.
As long as it doesn’t start hacking into its own runtime and peeking at the variables, this can mean that it doesn’t have a variable corresponding to its reward in it’s world-model, which would prevent it from wanting to use it for wireheading.
Of course this is unstable, so we probably wouldn’t want to rely on that. The stable approach would be what we discussed in the other thread, of manually coding the value function. This would protect against wireheading in fundamentally the same way, though, by eliminating the need for a separate “reward” variable in the world-model.
Here’s a hacky patch that doesn’t entirely solve it, but might help:
Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI’s world model to begin with.
As long as it doesn’t start hacking into its own runtime and peeking at the variables, this can mean that it doesn’t have a variable corresponding to its reward in it’s world-model, which would prevent it from wanting to use it for wireheading.
Of course this is unstable, so we probably wouldn’t want to rely on that. The stable approach would be what we discussed in the other thread, of manually coding the value function. This would protect against wireheading in fundamentally the same way, though, by eliminating the need for a separate “reward” variable in the world-model.