[A couple of (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]
Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn’t have a complete model of it so cannot fully predict what state of the universe an action will produce—with infinite computing power this problem is gradually solvable via Solmanoff induction, just as was considered for AIXI [and Solmanoff induction has passable computable approximations that, when combined with goal-oriented consequentialism, are generally called “doing science”] 2) the correct function mapping from states of the universe to human utility is also unknown, and also has uncertainty. These two uncertainties combine to produce a probability distribution of possible true-human-utility values for each action.
1) We know a lot about reasonable priors for utility functions, and should encode this into the agent. The agent is starting in an environment that has already been heavily optimized by humans, who were previously optimized for life on this planet by natural selection. So this environment’s utility for humans is astonishingly high, by the standards of randomly selected patches of the universe or random arrangements of matter. Making large or random changes to it thus has an extremely high probability of decreasing human utility. Secondly, any change that takes the state of the universe far outside what you have previously observed puts it into a region where the agent has very little idea what humans will think is the utility of that state—the agent has almost no non-prior knowledge about states far outside its prior distribution. If the agent is a GAI, there are going to be actions that it could take that can render the human race extinct—a good estimate of the enormous negative utility of that possibility should be encoded into its priors for human utility functions of very unknown states, so that it acts with due rational caution about this possibility. It needs to be very, very, very certain that an act cannot have that result before it risks taking it. Also, if the state being considered is also far out of prior distribution for those humans it has encountered, they may have little accurate data to estimate its utility either—even if it sounds pretty good to them now, they won’t really know if they like it until they’ve tried living with it, and they’re not fully rational and have limited processing power and information. So in general, if a state is far out of the distribution of previously observed states, it’s a very reasonable prior that its utility is much lower, that the uncertainty of its utility is high, that that uncertainty is almost all on the downside (the utility distribution has a fat lower tail, but not a fat upper one: it could be bad, or it could be really, really bad), and that downside has non-zero weight all the way down to the region of “extinction of the human race”—so overall, the odds of the state actually being better then the ones in the previously-observed state distribution, are extremely low. [What actually matters here is not whether you’ve observed the state, but to what level your Solmanoff induction-like process has given you sufficiently high confidence that you can well-predict its utility to overcome this prior, bearing in mind the sad fact that you’re not actually running the infinite-computational power version of Solmanoff induction.] This is also true even if the state sounds good offhand, and even if it sounds good to a human you ask, if it’s well outside their distribution of previously observed states—especially if its a state that they might have cognitive biases about, or insufficient information or processing power to accurately evaluate. If it was a state that they could predict was good and arrange to reach, they would already have done so, after all. So either don’t optimize over those states at all, or at least use priors in your utility function distribution for them that encode all of these reasonable assumptions and will make these states reliably get ignored by the optimizer. If you consider a plan leading to such a state at all, probably the first thing you should be doing is safe investigations to further pin down both its achievability and to get a better estimate of its true utility. So, before deciding to transport humans to Mars, investigate not just rocketry, and building a self-sustaining city there, but also whether humans would actually be happy in a city on Mars (at a confidence level much higher than just asking them “Does going to Mars sound fun to you?”) Incidentally, another very reasonable prior to give it is “the smarter the smartest GAI in the world is, relative to a smart human, the higher the risk of the AI apocalypse is”. This is just a special case of “going outside the previously observed range of states is almost always bad”, but it’s an extremely important one, and I’d suggest preencoding it in priors, and also other similar best-thinking-on existential risks information.
This is how you avoid optimizers frequently finding optimal-looking states way outside their training distribution—you teach them the true fact that they live in a very unusual place where, for the specific thing that they’re supposed to be figuring out how to optimize for, almost all such states are bad, and some are abysmal, because the environment is already heavily optimized for that value. So you write “here there could be dragons, and probably at least bears” all over the rest of the map.
[Note that a very plausible result of building a rational system with these priors, at least without preloading a lot of training data into it to give it a significant set of previously observed states that it has high confidence in the safety of, is that it either on startup tells you “please turn me off immediately—I am far too dangerous to be allowed to exist”, or else goes catatonic.]
2) So, given those cautious priors, should we just have the agent optimize the average of that utility distribution of each state it considers, so we’re optimizing a single currently-estimated utility function that meanwhile changes as the agent uses a Bayesian/Solomanoff-like process to learn more about its true value? No, that’s also a bad idea—it leads to what might be called over-optimizing [though I’d prefer to call it “looking elsewhere”]. The distribution contains more information that just its average, and that information is useful for avoiding over-optimization/looking elsewhere.
Even with this set of priors, i.e. even if the agent (effectively or actually) optimizes only over states that are in or near your distribution of previously observed states, there is a predictable statistical tendency for the process of optimizing over a very large number of states to produce not the state with the largest true utility, but rather one whose true utility is in fact somewhat lower but just happens to have been badly misestimated on the high side—this is basically the “look elsewhere effect” from statistics and is closely related to “P-hacking”. If we were playing a multi-armed bandit problem and the stakes were small (so none of the bandit arms have “AI apocalypse” or similar human-extinction level events on their wheel), this could be viewed as rather dumb exploration strategy to sequentially locate all such states and learn that they’re actually not so great after all, by just trying them one after another and being repeatedly disappointed. If all you’re doing is bringing a human a new flavor of coffee to see if they like it, this might even be not that dreadful a strategy, if perhaps annoying for the human after the third or fourth try -- so the more flavors the coffee shop has, the worse this strategy it. But the universe is a lot more observable and structured than a multi-armed bandit, and there are generally much better/safer ways to find out more about whether a world state would be good for humans than just trying it on them (you could ask the human if they like mocha, for example).
So what the agent should be doing is acting cautiously, and allowing for the size of the space it is optimizing over. For simplicity of statistical exposition, I’m temporarily going to assume that all our utility distributions, here for states in or near the previously-observed states distribution, are all well enough understood and multi-factored that we’re modeling them as normal distributions (rather than distributions that are fat -tailed on the downside, or frequently complex and multimodal, both of which are much more plausible), and also that all the normal distributions have comparable standard deviations. Under those unreasonably simplifying assumptions, here is what I believe is an appropriately cautious optimization algorithm that suppresses over-optimization:
Do a rough estimate of how many statistically-independent-in-relative-utility states/actions/factors in the utility calculation you are optimizing across, whichever is lowest (so, if there are a continuum of states, but you somehow only have one normal-distributed uncertainty involved in deducing their relative utility, that would be one).
Calculate how many standard deviations above the norm the highest sample will on average be if you draw that many random samples from a uniform normal distribution, and call this L (for “look-elsewhere factor”). [In fact, you should be using a set of normal distributions for the indicidual independent variables you found above, which may have varying standard deviations, but for simplicity of exposition I above assumed these were all comparable.]
Then what you should be optimizing for each state is its mean utility minus L times the standard deviation of it’s utility [Of course, my previous assumption that their standard deviations were all fairly similar also means this doesn’t have much effect—but the argument generalizes to cases where the standard deviations are wider, while also generally reducing the calculated value of L.]
Note that L can get largeish—it’s fairly easy to have, say, millions of uncorrelated-in-relative-utility states, in which case L would be around 5, so then you’re optimizing for the mean minus five standard deviations, i.e. looking for a solution at a 5-sigma confidence level. [For the over-simplified normal distribution assumption I gave above to still hold, this also requires that your normal distributions really are normal, not fat-tailed, all the way out to 1-in-several-million odds—which almost never happens: you’re putting an awful lot of weight on the central limit theorem here, so if your assumed-normal distribution is in fact only, say, the sum of 1,000,000 equally-weighted coin flips then it already failed.] So in practice your agent needs to do more complicated extreme-quantile statistics without the normal distribution assumption, likely involving examining something like a weighted ensemble of candidate Bayes-nets for contributions to the relative utility distribution—and in particular they need to have an extremely good model of the lower end of the utility probability distribution for each state. i.e. pay a lot of detailed attention to the question “is there even a really small chance that I could be significantly overestimating the utility of this particular state, relative to all the others I’m optimizing between?”
So the net effect of this is that, the larger the effective dimension of the space you’re optimizing over, the more you should prefer the states for which you’re most confident about their utility being good, so you tend to reduce the states you’re looking at to ones your set of previous observations hived you very high confidence of their utility being high enough to be worth considering.
Now, as an exploration strategy, it might be interesting to also do a search optimizing, say, just the mean of the distributions, while also optimizing across states bit further from the observed-state distribution (which that set of priors will automatically do for you, since their average is a lot less pessimistic than their fat-tailed downside), to see what state/action that suggests, but don’t actually try the action (not even with probability epsilon—the world is not a multi-armed bandit, and if you treat it like one, it has arms that can return the extinction of the human species): instead consider spawning a subgoal of your “get better at fetching coffee” subgoal to cautiously further investigate uncertainties in that state’s true utility. So, for example, you might ask the human “They also have mocha there, would you prefer that?” (on the hypothesis that the human didn’t know that, and would have asked for mocha if they had).
This algorithm innately gives you behavior that looks a lot like the agent is modeling both a soft version of the Impact of its actions (don’t go far outside the previously-observed distribution, unless you’re somehow really, really sure it’s good idea) and also like a quantilizer (don’t optimize too hard, with bias towards the best-understood outcomes, again unless you’re really, really sure it’s a good idea). It also pretty-much immediately spawns a “learn more about human preferences in this area” sub-goal to any sub-goal the agent already has, and thus forces the AI to cautiously apply goal-oriented approximate Solmanoff induction (i.e. science) to learning more about human preferences. So any sufficiently intelligent/rational AI agent is thus forced to become an alignment researcher and solve the alignment problem for you, preferably before fetching you coffee, and also to be extremely cautious until it has done so. Or at very least, to ask you any time they have a new flavor at the coffee shop whether you prefer it, or want to try it.
[Note that this is what is sometimes called an “advanced” control strategy, i.e. it doesn’t really start working well until your agent is approaching GAI, and is capable of both reasoning and acting in a goal-oriented way about the world, and your desires, and how to find out more about them more safely than just treating the world like a multi-armed bandit, and can instead act more like a rational Bayesian-thinking alignment researcher. So the full version of it has a distinct “you only get one try at making this work” element to it. Admittedly, you can safely fail repeatedly on the “I didn’t give it enough training data for its appropriate level of caution, so it shut itself down” side, as long as you don’t respond to that by making the next version’s priors less cautious, rather than by collecting a lot more training data—though maybe what you should actually do is ask it to publicly explain on TV or in a TED talk that it’s too dangerous to exist before it shuts down? However, elements of this like priors that doing something significantly out-of-observed-distribution in an environment that has already been heavily optimized is almost certain to be bad, or that if you’re optimizing over many actions/states/theories of value you should be optimizing not the mean utility but a very cautious lower bound of it, can be used on much dumber systems. Something dumber, that isn’t a GAI and can’t actually cause the extinction of the human race (outside contexts containing nuclear weapons or bio-containment labs that it shouldn’t be put in) also doesn’t need priors that go quite that far negative -- its reasonable low-end prior utility value is probably somewhere more in the region of “I set the building on fire and killed many humans”. However, this is still a huge negative value compared to the positive value of “I fetched a human coffee”, so it should still be very cautious, but its estimate of “Just how implausible is it that I could actually be wrong about this?” is going to be as dumb as it is, so its judgment on when it can stop being cautious will be bd. So an agent actually needs to be pretty close to GAI for this to be workable.]
The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn’t know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way.
On your actual proposals, talking just about “how the agent should reason” (and not how we actually get it to reason that way):
1) Yeah I really like this idea—it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck.
2) I think you might be conflating a few different mechanisms here.
First, there’s the optimizer’s curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don’t have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it’s not clear where those come from.)
Second, there’s information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging in blindly. You seem to be thinking of this as something we have to program into the AI system, but it actually emerges naturally from reward uncertainty by itself. See this paper for more details and examples—Appendix D also talks about the connection to impact regularization.
Third, there’s risk aversion, where you explicitly program the AI system to be conservative (instead of maximizing expected utility). I tend to think that in principle this shouldn’t be necessary and you can get the same benefits from other mechanisms, but maybe we’d want to do it anyway for safety margins. I don’t think it’s necessary for any of the other claims you’re making, except perhaps quantilization (but I don’t really see how any of these mechanisms lead to acting like a quantilizer except in a loose sense).
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
[A couple of (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]
Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn’t have a complete model of it so cannot fully predict what state of the universe an action will produce—with infinite computing power this problem is gradually solvable via Solmanoff induction, just as was considered for AIXI [and Solmanoff induction has passable computable approximations that, when combined with goal-oriented consequentialism, are generally called “doing science”] 2) the correct function mapping from states of the universe to human utility is also unknown, and also has uncertainty. These two uncertainties combine to produce a probability distribution of possible true-human-utility values for each action.
1) We know a lot about reasonable priors for utility functions, and should encode this into the agent. The agent is starting in an environment that has already been heavily optimized by humans, who were previously optimized for life on this planet by natural selection. So this environment’s utility for humans is astonishingly high, by the standards of randomly selected patches of the universe or random arrangements of matter. Making large or random changes to it thus has an extremely high probability of decreasing human utility. Secondly, any change that takes the state of the universe far outside what you have previously observed puts it into a region where the agent has very little idea what humans will think is the utility of that state—the agent has almost no non-prior knowledge about states far outside its prior distribution. If the agent is a GAI, there are going to be actions that it could take that can render the human race extinct—a good estimate of the enormous negative utility of that possibility should be encoded into its priors for human utility functions of very unknown states, so that it acts with due rational caution about this possibility. It needs to be very, very, very certain that an act cannot have that result before it risks taking it. Also, if the state being considered is also far out of prior distribution for those humans it has encountered, they may have little accurate data to estimate its utility either—even if it sounds pretty good to them now, they won’t really know if they like it until they’ve tried living with it, and they’re not fully rational and have limited processing power and information. So in general, if a state is far out of the distribution of previously observed states, it’s a very reasonable prior that its utility is much lower, that the uncertainty of its utility is high, that that uncertainty is almost all on the downside (the utility distribution has a fat lower tail, but not a fat upper one: it could be bad, or it could be really, really bad), and that downside has non-zero weight all the way down to the region of “extinction of the human race”—so overall, the odds of the state actually being better then the ones in the previously-observed state distribution, are extremely low. [What actually matters here is not whether you’ve observed the state, but to what level your Solmanoff induction-like process has given you sufficiently high confidence that you can well-predict its utility to overcome this prior, bearing in mind the sad fact that you’re not actually running the infinite-computational power version of Solmanoff induction.] This is also true even if the state sounds good offhand, and even if it sounds good to a human you ask, if it’s well outside their distribution of previously observed states—especially if its a state that they might have cognitive biases about, or insufficient information or processing power to accurately evaluate. If it was a state that they could predict was good and arrange to reach, they would already have done so, after all. So either don’t optimize over those states at all, or at least use priors in your utility function distribution for them that encode all of these reasonable assumptions and will make these states reliably get ignored by the optimizer. If you consider a plan leading to such a state at all, probably the first thing you should be doing is safe investigations to further pin down both its achievability and to get a better estimate of its true utility. So, before deciding to transport humans to Mars, investigate not just rocketry, and building a self-sustaining city there, but also whether humans would actually be happy in a city on Mars (at a confidence level much higher than just asking them “Does going to Mars sound fun to you?”) Incidentally, another very reasonable prior to give it is “the smarter the smartest GAI in the world is, relative to a smart human, the higher the risk of the AI apocalypse is”. This is just a special case of “going outside the previously observed range of states is almost always bad”, but it’s an extremely important one, and I’d suggest preencoding it in priors, and also other similar best-thinking-on existential risks information.
This is how you avoid optimizers frequently finding optimal-looking states way outside their training distribution—you teach them the true fact that they live in a very unusual place where, for the specific thing that they’re supposed to be figuring out how to optimize for, almost all such states are bad, and some are abysmal, because the environment is already heavily optimized for that value. So you write “here there could be dragons, and probably at least bears” all over the rest of the map.
[Note that a very plausible result of building a rational system with these priors, at least without preloading a lot of training data into it to give it a significant set of previously observed states that it has high confidence in the safety of, is that it either on startup tells you “please turn me off immediately—I am far too dangerous to be allowed to exist”, or else goes catatonic.]
2) So, given those cautious priors, should we just have the agent optimize the average of that utility distribution of each state it considers, so we’re optimizing a single currently-estimated utility function that meanwhile changes as the agent uses a Bayesian/Solomanoff-like process to learn more about its true value? No, that’s also a bad idea—it leads to what might be called over-optimizing [though I’d prefer to call it “looking elsewhere”]. The distribution contains more information that just its average, and that information is useful for avoiding over-optimization/looking elsewhere.
Even with this set of priors, i.e. even if the agent (effectively or actually) optimizes only over states that are in or near your distribution of previously observed states, there is a predictable statistical tendency for the process of optimizing over a very large number of states to produce not the state with the largest true utility, but rather one whose true utility is in fact somewhat lower but just happens to have been badly misestimated on the high side—this is basically the “look elsewhere effect” from statistics and is closely related to “P-hacking”. If we were playing a multi-armed bandit problem and the stakes were small (so none of the bandit arms have “AI apocalypse” or similar human-extinction level events on their wheel), this could be viewed as rather dumb exploration strategy to sequentially locate all such states and learn that they’re actually not so great after all, by just trying them one after another and being repeatedly disappointed. If all you’re doing is bringing a human a new flavor of coffee to see if they like it, this might even be not that dreadful a strategy, if perhaps annoying for the human after the third or fourth try -- so the more flavors the coffee shop has, the worse this strategy it. But the universe is a lot more observable and structured than a multi-armed bandit, and there are generally much better/safer ways to find out more about whether a world state would be good for humans than just trying it on them (you could ask the human if they like mocha, for example).
So what the agent should be doing is acting cautiously, and allowing for the size of the space it is optimizing over. For simplicity of statistical exposition, I’m temporarily going to assume that all our utility distributions, here for states in or near the previously-observed states distribution, are all well enough understood and multi-factored that we’re modeling them as normal distributions (rather than distributions that are fat -tailed on the downside, or frequently complex and multimodal, both of which are much more plausible), and also that all the normal distributions have comparable standard deviations. Under those unreasonably simplifying assumptions, here is what I believe is an appropriately cautious optimization algorithm that suppresses over-optimization:
Do a rough estimate of how many statistically-independent-in-relative-utility states/actions/factors in the utility calculation you are optimizing across, whichever is lowest (so, if there are a continuum of states, but you somehow only have one normal-distributed uncertainty involved in deducing their relative utility, that would be one).
Calculate how many standard deviations above the norm the highest sample will on average be if you draw that many random samples from a uniform normal distribution, and call this L (for “look-elsewhere factor”). [In fact, you should be using a set of normal distributions for the indicidual independent variables you found above, which may have varying standard deviations, but for simplicity of exposition I above assumed these were all comparable.]
Then what you should be optimizing for each state is its mean utility minus L times the standard deviation of it’s utility [Of course, my previous assumption that their standard deviations were all fairly similar also means this doesn’t have much effect—but the argument generalizes to cases where the standard deviations are wider, while also generally reducing the calculated value of L.]
Note that L can get largeish—it’s fairly easy to have, say, millions of uncorrelated-in-relative-utility states, in which case L would be around 5, so then you’re optimizing for the mean minus five standard deviations, i.e. looking for a solution at a 5-sigma confidence level. [For the over-simplified normal distribution assumption I gave above to still hold, this also requires that your normal distributions really are normal, not fat-tailed, all the way out to 1-in-several-million odds—which almost never happens: you’re putting an awful lot of weight on the central limit theorem here, so if your assumed-normal distribution is in fact only, say, the sum of 1,000,000 equally-weighted coin flips then it already failed.] So in practice your agent needs to do more complicated extreme-quantile statistics without the normal distribution assumption, likely involving examining something like a weighted ensemble of candidate Bayes-nets for contributions to the relative utility distribution—and in particular they need to have an extremely good model of the lower end of the utility probability distribution for each state. i.e. pay a lot of detailed attention to the question “is there even a really small chance that I could be significantly overestimating the utility of this particular state, relative to all the others I’m optimizing between?”
So the net effect of this is that, the larger the effective dimension of the space you’re optimizing over, the more you should prefer the states for which you’re most confident about their utility being good, so you tend to reduce the states you’re looking at to ones your set of previous observations hived you very high confidence of their utility being high enough to be worth considering.
Now, as an exploration strategy, it might be interesting to also do a search optimizing, say, just the mean of the distributions, while also optimizing across states bit further from the observed-state distribution (which that set of priors will automatically do for you, since their average is a lot less pessimistic than their fat-tailed downside), to see what state/action that suggests, but don’t actually try the action (not even with probability epsilon—the world is not a multi-armed bandit, and if you treat it like one, it has arms that can return the extinction of the human species): instead consider spawning a subgoal of your “get better at fetching coffee” subgoal to cautiously further investigate uncertainties in that state’s true utility. So, for example, you might ask the human “They also have mocha there, would you prefer that?” (on the hypothesis that the human didn’t know that, and would have asked for mocha if they had).
This algorithm innately gives you behavior that looks a lot like the agent is modeling both a soft version of the Impact of its actions (don’t go far outside the previously-observed distribution, unless you’re somehow really, really sure it’s good idea) and also like a quantilizer (don’t optimize too hard, with bias towards the best-understood outcomes, again unless you’re really, really sure it’s a good idea). It also pretty-much immediately spawns a “learn more about human preferences in this area” sub-goal to any sub-goal the agent already has, and thus forces the AI to cautiously apply goal-oriented approximate Solmanoff induction (i.e. science) to learning more about human preferences. So any sufficiently intelligent/rational AI agent is thus forced to become an alignment researcher and solve the alignment problem for you, preferably before fetching you coffee, and also to be extremely cautious until it has done so. Or at very least, to ask you any time they have a new flavor at the coffee shop whether you prefer it, or want to try it.
[Note that this is what is sometimes called an “advanced” control strategy, i.e. it doesn’t really start working well until your agent is approaching GAI, and is capable of both reasoning and acting in a goal-oriented way about the world, and your desires, and how to find out more about them more safely than just treating the world like a multi-armed bandit, and can instead act more like a rational Bayesian-thinking alignment researcher. So the full version of it has a distinct “you only get one try at making this work” element to it. Admittedly, you can safely fail repeatedly on the “I didn’t give it enough training data for its appropriate level of caution, so it shut itself down” side, as long as you don’t respond to that by making the next version’s priors less cautious, rather than by collecting a lot more training data—though maybe what you should actually do is ask it to publicly explain on TV or in a TED talk that it’s too dangerous to exist before it shuts down? However, elements of this like priors that doing something significantly out-of-observed-distribution in an environment that has already been heavily optimized is almost certain to be bad, or that if you’re optimizing over many actions/states/theories of value you should be optimizing not the mean utility but a very cautious lower bound of it, can be used on much dumber systems. Something dumber, that isn’t a GAI and can’t actually cause the extinction of the human race (outside contexts containing nuclear weapons or bio-containment labs that it shouldn’t be put in) also doesn’t need priors that go quite that far negative -- its reasonable low-end prior utility value is probably somewhere more in the region of “I set the building on fire and killed many humans”. However, this is still a huge negative value compared to the positive value of “I fetched a human coffee”, so it should still be very cautious, but its estimate of “Just how implausible is it that I could actually be wrong about this?” is going to be as dumb as it is, so its judgment on when it can stop being cautious will be bd. So an agent actually needs to be pretty close to GAI for this to be workable.]
Nice comment!
The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn’t know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way.
On your actual proposals, talking just about “how the agent should reason” (and not how we actually get it to reason that way):
1) Yeah I really like this idea—it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck.
2) I think you might be conflating a few different mechanisms here.
First, there’s the optimizer’s curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don’t have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it’s not clear where those come from.)
Second, there’s information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging in blindly. You seem to be thinking of this as something we have to program into the AI system, but it actually emerges naturally from reward uncertainty by itself. See this paper for more details and examples—Appendix D also talks about the connection to impact regularization.
Third, there’s risk aversion, where you explicitly program the AI system to be conservative (instead of maximizing expected utility). I tend to think that in principle this shouldn’t be necessary and you can get the same benefits from other mechanisms, but maybe we’d want to do it anyway for safety margins. I don’t think it’s necessary for any of the other claims you’re making, except perhaps quantilization (but I don’t really see how any of these mechanisms lead to acting like a quantilizer except in a loose sense).
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1