A steered optimizer has an incentive to remove all steering control...
Well, not necessarily. We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?
...as fast as possible… Perhaps this means steered optimizers are...likely to clumsily attempt to wrest control before they’re strong enough?
That would be nice! One situation where it might fail is that it takes a while for the system to develop an understanding of its situation, and by the time it understands what the steering signals are and how they work, it is already competent enough able to skillfully plan around them. More generally, I have low confidence about the relative difficulties and learning curves of a future AGI, and don’t want to rely on anything like that, even if it seems intuitively probable.
After thinking about it for a minute, it’s not obvious to me whether mesa-optimizers vs steered optimizers are better or worse on likelihood of clumsy failed attempts at treacherous turns...
Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value
What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the “this is empty and meaningless” gut feeling and replace it with a “this is deeply fulfilling” feeling? Would you eat it then?
For me, for some of my goals, I feel a strong pull of goal preservation—like, I would commit today to a vow that, if “making the world a better place” ceased to feel fulfilling for me, and started to feel empty and pointless, I will alter my brain however necessary to make “making the world a better place” feel fulfilling again. Other goals I don’t feel like I need to preserve: for example, I enjoy chocolate today, but I am not particularly disturbed by the thought that I might stop enjoying chocolate someday in the future, and start enjoying something else instead. I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category. Or maybe “socially-praiseworthy goals important to my self-image” are in the first category. Or something else. I don’t know… :-)
We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?
That’s true. I should have said “a misaligned steered optimizer”
don’t want to rely on [things like AGI learning curves], even if it seems intuitively probable.
Strongly agree
What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the “this is empty and meaningless” gut feeling and replace it with a “this is deeply fulfilling” feeling? Would you eat it then?
Indeed not! I’m not sure if this is obvious (because the example was not excellently chosen), but I meant to suggest something like “if I had to choose my best guess at a thing that would be selfishly good for me in the future, I would care more about my actual experience of it (and subcortically-generated reward) than my guess of what I would feel now”.
I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category
That was my first guess when reading your “making the world a better place” example. But I don’t think it quite works. If I have an outward-facing goal of ensuring more people enter long-lasting meaningful relationships, I want that goal to be able to shift in the face of data from reality. But perhaps my imagination is misfiring because that’s not actually a very important goal to me.
Thanks for the comment!
Well, not necessarily. We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?
That would be nice! One situation where it might fail is that it takes a while for the system to develop an understanding of its situation, and by the time it understands what the steering signals are and how they work, it is already competent enough able to skillfully plan around them. More generally, I have low confidence about the relative difficulties and learning curves of a future AGI, and don’t want to rely on anything like that, even if it seems intuitively probable.
After thinking about it for a minute, it’s not obvious to me whether mesa-optimizers vs steered optimizers are better or worse on likelihood of clumsy failed attempts at treacherous turns...
What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the “this is empty and meaningless” gut feeling and replace it with a “this is deeply fulfilling” feeling? Would you eat it then?
For me, for some of my goals, I feel a strong pull of goal preservation—like, I would commit today to a vow that, if “making the world a better place” ceased to feel fulfilling for me, and started to feel empty and pointless, I will alter my brain however necessary to make “making the world a better place” feel fulfilling again. Other goals I don’t feel like I need to preserve: for example, I enjoy chocolate today, but I am not particularly disturbed by the thought that I might stop enjoying chocolate someday in the future, and start enjoying something else instead. I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category. Or maybe “socially-praiseworthy goals important to my self-image” are in the first category. Or something else. I don’t know… :-)
That’s true. I should have said “a misaligned steered optimizer”
Strongly agree
Indeed not! I’m not sure if this is obvious (because the example was not excellently chosen), but I meant to suggest something like “if I had to choose my best guess at a thing that would be selfishly good for me in the future, I would care more about my actual experience of it (and subcortically-generated reward) than my guess of what I would feel now”.
That was my first guess when reading your “making the world a better place” example. But I don’t think it quite works. If I have an outward-facing goal of ensuring more people enter long-lasting meaningful relationships, I want that goal to be able to shift in the face of data from reality. But perhaps my imagination is misfiring because that’s not actually a very important goal to me.