Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
and from the post:
And if we get to a point where we can design reward signals that sculpt an AGI’s motivation with surgical precision, that’s fine!
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.
Sorry for my very late reply!
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
and from the post:
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.