Mod note: The LessWrong team is experimenting with stricter moderation, generally raising the bar of which new-user-content gets approved on LessWrong.
I’m approving this post, but to give people a sense of how I’m thinking about things, this is on-the-edge of what I’d approve, and some other mods weren’t sure they’d approve it. In my ideal LessWrong we had the capacity to onboard all incoming users with this type of post, but capacity of both moderators and established users is somewhat limited.
This post feels like a decent start at thinking about the alignment problem, but isn’t really engaging with the hard parts, such as:
if your AI doesn’t actually help somehow to end the acute risk period, you’ve either built a very expensive rock, or you’ve just sort of incrementally advanced AI capabilities in a world without helping solve the hard parts of the alignment problem. You haven’t specified how to do that.
I recommend reading through AGI Ruin: A List of Lethalities and Alignment Research Field Guide to get a sense of what problems to explore in more detail and some advice on how to tackle them. I also recommend generally babbling a larger list of ideas, questions or projects to explore so you aren’t overly anchored on one major idea.
I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: “Can the AI produce a misaligned plan that gets past all the safety measures?” And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).
I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won’t get myself anchored to the “one major idea”!
EDIT: Initial additional thoughts on this:
If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:
1) The AI reaches sufficient intelligence that every plan it suggests is something we won’t implement, thus producing a very expensive rock. 2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won’t action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.
EDIT 2: Areas of further inquiry:
1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to “make two identical strawberries without killing anyone” after we ask it to “make two identical strawberries” and it suggested a plan (which we reward it for) that would kill everyone if actioned? 2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with “utility” calculated at 100, but also provides a plan that kills a small number of people with a calculated “utility” of 50, and finally a plan that kills nobody with a calculated “utility” of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.
Glad to hear new thinkers grappling with the problem. I agree with what some of the other commenters have said about the thoughts here being unfinished, but I also think that that is a reasonable place to start.
One approach forward could be asking yourself about how this could be given more robustness in the case of highly general and very smart systems. I don’t yet see a path forward towards that for this plan, but you might.
Another approach forward would be to aim for developing this tool for an easier use case. Could a narrowly superhuman plan-suggester which made detailed risk estimates of a wide variety of possible options be useful to humanity in the regime where we were still able to safely oversee it? I think so. For instance, perhaps we could ask it to help us design a system of rewards (e.g competitions) and punishments (e.g. legislation enacting fines) that would help us reshape the AI development landscape to be less of a Molochian race-to-the-bottom and more of a virtuous Win-Win landscape. For more background on this idea see: [Future of Life Institute Podcast] Liv Boeree on Moloch, Beauty Filters, Game Theory, Institutions, and AI #futureOfLifeInstitutePodcast
https://podcastaddict.com/episode/154738782
Mod note: The LessWrong team is experimenting with stricter moderation, generally raising the bar of which new-user-content gets approved on LessWrong.
I’m approving this post, but to give people a sense of how I’m thinking about things, this is on-the-edge of what I’d approve, and some other mods weren’t sure they’d approve it. In my ideal LessWrong we had the capacity to onboard all incoming users with this type of post, but capacity of both moderators and established users is somewhat limited.
This post feels like a decent start at thinking about the alignment problem, but isn’t really engaging with the hard parts, such as:
if your AI doesn’t actually help somehow to end the acute risk period, you’ve either built a very expensive rock, or you’ve just sort of incrementally advanced AI capabilities in a world without helping solve the hard parts of the alignment problem. You haven’t specified how to do that.
if the optimization level of this AI was turned up high enough to output genuinely interesting novel claims, it could develop dangerous consequentialism as a side effect (see What’s Up With Confusingly Pervasive Goal Directedness? and A central AI alignment problem: capabilities generalization, and the sharp left turn )
I recommend reading through AGI Ruin: A List of Lethalities and Alignment Research Field Guide to get a sense of what problems to explore in more detail and some advice on how to tackle them. I also recommend generally babbling a larger list of ideas, questions or projects to explore so you aren’t overly anchored on one major idea.
Hi Raemon,
Thanks for the reply.
I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: “Can the AI produce a misaligned plan that gets past all the safety measures?” And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).
I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won’t get myself anchored to the “one major idea”!
EDIT: Initial additional thoughts on this:
If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:
1) The AI reaches sufficient intelligence that every plan it suggests is something we won’t implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won’t action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.
EDIT 2: Areas of further inquiry:
1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to “make two identical strawberries without killing anyone” after we ask it to “make two identical strawberries” and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with “utility” calculated at 100, but also provides a plan that kills a small number of people with a calculated “utility” of 50, and finally a plan that kills nobody with a calculated “utility” of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.
Glad to hear new thinkers grappling with the problem. I agree with what some of the other commenters have said about the thoughts here being unfinished, but I also think that that is a reasonable place to start. One approach forward could be asking yourself about how this could be given more robustness in the case of highly general and very smart systems. I don’t yet see a path forward towards that for this plan, but you might. Another approach forward would be to aim for developing this tool for an easier use case. Could a narrowly superhuman plan-suggester which made detailed risk estimates of a wide variety of possible options be useful to humanity in the regime where we were still able to safely oversee it? I think so. For instance, perhaps we could ask it to help us design a system of rewards (e.g competitions) and punishments (e.g. legislation enacting fines) that would help us reshape the AI development landscape to be less of a Molochian race-to-the-bottom and more of a virtuous Win-Win landscape. For more background on this idea see: [Future of Life Institute Podcast] Liv Boeree on Moloch, Beauty Filters, Game Theory, Institutions, and AI #futureOfLifeInstitutePodcast https://podcastaddict.com/episode/154738782