If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.
To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can’t easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
Technically, you don’t need this assumption. As I wrote in this comment: “it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.”
Yeah, it seems possible and interesting to formalize an argument like that.
(I haven’t read a lot about quantilization so I can’t say much about that. However, a superintelligent adversary seems like something to avoid.)
The “adversary” can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you’ve got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you’ll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.
OTOH, although quantilization can be justified by a story involving an actual adversary, that’s not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you’re trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That’s basically what’s going on in the case of quantilization.
Similarly, “adversarial Goodhart” doesn’t have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
The siren world scenario posits an AI that is “actually evil” and is an agent which makes plans to manipulate the user.
If the AI assigns decent credence to a utility function that assigns massive negative utility to “evil and unmitigated suffering”, that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.
I also don’t think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.
The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don’t think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won’t have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:
Make a system that’s capable of changing its values “online” in response to our input. Corrigibility lets us procrastinate on moral philosophy.
Instead of trying to build eutopia right off the bat, build an “optimal ivory tower” for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.
Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see “Smile maximization case study” in this essay.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. “Off-distribution” plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.
the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather
Thanks for bringing this up, I’ll think about it. Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)
I am a little frustrated with your reply (particularly the first half), but I’m not sure if you’re really missing my point (perhaps I’ll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I’ll have to think more about some of your points.
Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don’t clearly hold up when a lot of optimization pressure is applied.
If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high.
The VOI incentive can’t be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible ui for the sake of others.
This would be untrue if, for example, the system maximized the weighted product (the weight wi is used as an exponent of the hypothesis ui). It would then actually never be worth it to entirely zero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I’m not bemoaning an inevitable fact of decision theory—there are alternatives.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good.
This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn’t seem to hold up as a lot of optimization pressure is applied. Good plans will tend to be on-distribution, because that’s a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would the best plans tend to be on-distribution? Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design.
Yeah, that’s the direction I’m thinking in. By the way—I’m not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you’ve got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven’t realized we want yet). I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution.
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
Quantilization follows fairly directly from that :)
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
Yeah, it seems possible and interesting to formalize an argument like that.
The “adversary” can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you’ve got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you’ll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.
OTOH, although quantilization can be justified by a story involving an actual adversary, that’s not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you’re trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That’s basically what’s going on in the case of quantilization.
Similarly, “adversarial Goodhart” doesn’t have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.
The siren world scenario posits an AI that is “actually evil” and is an agent which makes plans to manipulate the user.
If the AI assigns decent credence to a utility function that assigns massive negative utility to “evil and unmitigated suffering”, that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.
I also don’t think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.
The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don’t think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won’t have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:
Make a system that’s capable of changing its values “online” in response to our input. Corrigibility lets us procrastinate on moral philosophy.
Instead of trying to build eutopia right off the bat, build an “optimal ivory tower” for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.
Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see “Smile maximization case study” in this essay.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. “Off-distribution” plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.
Thanks for bringing this up, I’ll think about it. Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)
I am a little frustrated with your reply (particularly the first half), but I’m not sure if you’re really missing my point (perhaps I’ll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I’ll have to think more about some of your points.
Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don’t clearly hold up when a lot of optimization pressure is applied.
The VOI incentive can’t be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible ui for the sake of others.
This would be untrue if, for example, the system maximized the weighted product (the weight wi is used as an exponent of the hypothesis ui). It would then actually never be worth it to entirely zero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I’m not bemoaning an inevitable fact of decision theory—there are alternatives.
This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn’t seem to hold up as a lot of optimization pressure is applied. Good plans will tend to be on-distribution, because that’s a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would the best plans tend to be on-distribution? Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Yeah, that’s the direction I’m thinking in. By the way—I’m not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you’ve got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven’t realized we want yet). I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Edit: related discussion here.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.