A thought: it seems to me like the algorithm you’re describing here is highly non-robust to relative scale, since if the neocortex became a lot stronger it could probably just find some way to deceive/trick/circumvent the subcortex to get more reward and/or avoid future updates. I think I’d be pretty worried about that failure case if anything like this algorithm were ever to be actually implemented in an AI.
Thanks! Yes, I am also definitely worried about that.
I 100% agree that the default result, in the absence of careful effort, would be value lock-in at some point in time, when the neocortex part grows clever enough to undermine the subcortex part, and then you better hope that the locked-in values are what you want!
On the optimistic side:
There’s no law that says the subcortex part has to be super dumb and simple; we can have less-powerful AIs steering more powerful AIs, helped by intrusive interpretability tools, running faster and in multiple instances, etc. (as has been discussed in other contexts of course);
We can try to instill a motivation system from the start that doesn’t want to undermine the subcortex part—in particular, corrigible motivation. This is basically reliant on the “corrigibility is a broad basin of attraction” argument being correct, I think.
On the pessimistic side, I’m not at all confident that either of those things would work. For (2) in particular, I remain concerned about ontological crises (or other types of goal instability upon learning and reflection) undermining corrigibility after an indeterminate amount of time. (This remains my go-to example of a possibly-unsolvable safety problem, or at least I have no idea how to solve it.)
So yeah, maybe we would be doomed in this scenario (or at least doomed to roll the dice). Or maybe we just need to keep working on it :-)
A thought: it seems to me like the algorithm you’re describing here is highly non-robust to relative scale, since if the neocortex became a lot stronger it could probably just find some way to deceive/trick/circumvent the subcortex to get more reward and/or avoid future updates. I think I’d be pretty worried about that failure case if anything like this algorithm were ever to be actually implemented in an AI.
Thanks! Yes, I am also definitely worried about that.
I 100% agree that the default result, in the absence of careful effort, would be value lock-in at some point in time, when the neocortex part grows clever enough to undermine the subcortex part, and then you better hope that the locked-in values are what you want!
On the optimistic side:
There’s no law that says the subcortex part has to be super dumb and simple; we can have less-powerful AIs steering more powerful AIs, helped by intrusive interpretability tools, running faster and in multiple instances, etc. (as has been discussed in other contexts of course);
We can try to instill a motivation system from the start that doesn’t want to undermine the subcortex part—in particular, corrigible motivation. This is basically reliant on the “corrigibility is a broad basin of attraction” argument being correct, I think.
On the pessimistic side, I’m not at all confident that either of those things would work. For (2) in particular, I remain concerned about ontological crises (or other types of goal instability upon learning and reflection) undermining corrigibility after an indeterminate amount of time. (This remains my go-to example of a possibly-unsolvable safety problem, or at least I have no idea how to solve it.)
So yeah, maybe we would be doomed in this scenario (or at least doomed to roll the dice). Or maybe we just need to keep working on it :-)