I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: “Can the AI produce a misaligned plan that gets past all the safety measures?” And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).
I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won’t get myself anchored to the “one major idea”!
EDIT: Initial additional thoughts on this:
If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:
1) The AI reaches sufficient intelligence that every plan it suggests is something we won’t implement, thus producing a very expensive rock. 2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won’t action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.
EDIT 2: Areas of further inquiry:
1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to “make two identical strawberries without killing anyone” after we ask it to “make two identical strawberries” and it suggested a plan (which we reward it for) that would kill everyone if actioned? 2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with “utility” calculated at 100, but also provides a plan that kills a small number of people with a calculated “utility” of 50, and finally a plan that kills nobody with a calculated “utility” of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.
Glad to hear new thinkers grappling with the problem. I agree with what some of the other commenters have said about the thoughts here being unfinished, but I also think that that is a reasonable place to start.
One approach forward could be asking yourself about how this could be given more robustness in the case of highly general and very smart systems. I don’t yet see a path forward towards that for this plan, but you might.
Another approach forward would be to aim for developing this tool for an easier use case. Could a narrowly superhuman plan-suggester which made detailed risk estimates of a wide variety of possible options be useful to humanity in the regime where we were still able to safely oversee it? I think so. For instance, perhaps we could ask it to help us design a system of rewards (e.g competitions) and punishments (e.g. legislation enacting fines) that would help us reshape the AI development landscape to be less of a Molochian race-to-the-bottom and more of a virtuous Win-Win landscape. For more background on this idea see: [Future of Life Institute Podcast] Liv Boeree on Moloch, Beauty Filters, Game Theory, Institutions, and AI #futureOfLifeInstitutePodcast
https://podcastaddict.com/episode/154738782
Hi Raemon,
Thanks for the reply.
I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: “Can the AI produce a misaligned plan that gets past all the safety measures?” And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).
I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won’t get myself anchored to the “one major idea”!
EDIT: Initial additional thoughts on this:
If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:
1) The AI reaches sufficient intelligence that every plan it suggests is something we won’t implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won’t action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.
EDIT 2: Areas of further inquiry:
1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to “make two identical strawberries without killing anyone” after we ask it to “make two identical strawberries” and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with “utility” calculated at 100, but also provides a plan that kills a small number of people with a calculated “utility” of 50, and finally a plan that kills nobody with a calculated “utility” of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.
Glad to hear new thinkers grappling with the problem. I agree with what some of the other commenters have said about the thoughts here being unfinished, but I also think that that is a reasonable place to start. One approach forward could be asking yourself about how this could be given more robustness in the case of highly general and very smart systems. I don’t yet see a path forward towards that for this plan, but you might. Another approach forward would be to aim for developing this tool for an easier use case. Could a narrowly superhuman plan-suggester which made detailed risk estimates of a wide variety of possible options be useful to humanity in the regime where we were still able to safely oversee it? I think so. For instance, perhaps we could ask it to help us design a system of rewards (e.g competitions) and punishments (e.g. legislation enacting fines) that would help us reshape the AI development landscape to be less of a Molochian race-to-the-bottom and more of a virtuous Win-Win landscape. For more background on this idea see: [Future of Life Institute Podcast] Liv Boeree on Moloch, Beauty Filters, Game Theory, Institutions, and AI #futureOfLifeInstitutePodcast https://podcastaddict.com/episode/154738782