However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.
Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?
Calibration seems like it does, in fact, significantly address regressional Goodheart. You can’t have seen a lot of instances of an estimate being too high, and still accept that too-high estimate. It doesn’t address extremal Goodheart, because calibrated learning can only guarantee that you eventually calibrate, or converge at some rate, or something like that—extreme values that you’ve rarely encountered would remain a concern.
In any case… I’m not exactly sure what you mean by “calibration”, but when I say “calibration”, I refer to “knowing what you know”. For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I’m reasonably “well-calibrated”; that is, I have a sense of what I do and don’t know.
A calibrated AI system, to me, is one that correctly says “this thing I’m looking at is an unusual thing I’ve never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high”.
Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.
I’m not super hung up on statistical guarantees, as I haven’t yet seen a way to make them in general which doesn’t require making some sort of unreasonable or impractical assumption about the world (and I’m skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.
If there’s a true utility function which is assigned some weight, and we apply a whole lot of optimization pressure to the overall mixture distribution, then it is perfectly possible that the true utility function gets compromised for the sake of satisfying a large number of other possible utility functions.
If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.
To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can’t easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.
Why should we think that there’s a “true” utility function which captures our preferences? And, if there is, why should we assume that it has an explicit representation in the hypothesis space?
Technically, you don’t need this assumption. As I wrote in this comment: “it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.”
(I haven’t read a lot about quantilization so I can’t say much about that. However, a superintelligent adversary seems like something to avoid.)
However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.
Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?
If there’s a systematic bias in the score, the thing with the highest score may not always be the best action to take. Calibrating the estimates may change the ranking of options.
For example, it could be that expected values above 0.99 are almost always significant overestimates, with an average true value of 0.5. A calibrated learner would observe this and systematically correct such items downwards. The new top choices would probably have values like 0.989 (if that’s the only correction applied).
This provides something of a guarantee that systematic Goodhart-type problems will eventually be recognized and corrected, to the extent which they occur.
A meta-rule like that, which corrects observed biases in the aggregate scores, isn’t easy to represent as a direct object-level hypothesis about the data. That’s why calibrated learning may not be Bayesian. And, without a calibration guarantee, you’d need some other argument as to why representing uncertainty helps to avoid Goodhart.
However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.
Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?
As a very high level, first-pass approximation, I think the right way to think of this is as a sort of unit test; even if we can’t directly see a reason why systematically incorrect estimates would cause problems in an AI design, this is an obvious enough desiderata that we should by default assume a system which breaks it is bad, unless we can prove otherwise.
Closer to the object level—yes, the highest-scoring action is the correct action to take, and if you model miscalibration as a single, monotonic function applied as the last step before deciding, then it can’t change any decisions. But if miscalibration can affect any intermediate steps, then this doesn’t hold. As a simple example: suppose the AI is deciding whether to pay to preserve its access to a category of options which it knows are highly subject to Regressional Goodhart.
If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.
To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can’t easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
Technically, you don’t need this assumption. As I wrote in this comment: “it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.”
Yeah, it seems possible and interesting to formalize an argument like that.
(I haven’t read a lot about quantilization so I can’t say much about that. However, a superintelligent adversary seems like something to avoid.)
The “adversary” can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you’ve got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you’ll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.
OTOH, although quantilization can be justified by a story involving an actual adversary, that’s not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you’re trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That’s basically what’s going on in the case of quantilization.
Similarly, “adversarial Goodhart” doesn’t have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
The siren world scenario posits an AI that is “actually evil” and is an agent which makes plans to manipulate the user.
If the AI assigns decent credence to a utility function that assigns massive negative utility to “evil and unmitigated suffering”, that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.
I also don’t think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.
The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don’t think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won’t have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:
Make a system that’s capable of changing its values “online” in response to our input. Corrigibility lets us procrastinate on moral philosophy.
Instead of trying to build eutopia right off the bat, build an “optimal ivory tower” for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.
Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see “Smile maximization case study” in this essay.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. “Off-distribution” plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.
the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather
Thanks for bringing this up, I’ll think about it. Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)
I am a little frustrated with your reply (particularly the first half), but I’m not sure if you’re really missing my point (perhaps I’ll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I’ll have to think more about some of your points.
Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don’t clearly hold up when a lot of optimization pressure is applied.
If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high.
The VOI incentive can’t be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible ui for the sake of others.
This would be untrue if, for example, the system maximized the weighted product (the weight wi is used as an exponent of the hypothesis ui). It would then actually never be worth it to entirely zero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I’m not bemoaning an inevitable fact of decision theory—there are alternatives.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good.
This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn’t seem to hold up as a lot of optimization pressure is applied. Good plans will tend to be on-distribution, because that’s a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would the best plans tend to be on-distribution? Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design.
Yeah, that’s the direction I’m thinking in. By the way—I’m not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you’ve got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven’t realized we want yet). I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution.
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
Quantilization follows fairly directly from that :)
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.
If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.
I think that’s right. Perhaps there is a small distinction to be brought out, but basically, extremal Goodhart is distributional shift brought about by the fact that the AI is optimizing hard.
In any case… I’m not exactly sure what you mean by “calibration”, but when I say “calibration”, I refer to “knowing what you know”. For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I’m reasonably “well-calibrated”; that is, I have a sense of what I do and don’t know.
A calibrated AI system, to me, is one that correctly says “this thing I’m looking at is an unusual thing I’ve never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high”.
Here’s what I mean by calibration: there’s a function from the probability you give to the frequency observed (or from the expected value you give to the average value observed), and the function approaches a straight x=y line as you learn. That’s basically what you describe in the example of the online test. However, in ML, there’s a difference between knows-what-it-knows learning (KWIK learning) and calibrated learning. KWIK learning is more like what you describe in the second paragraph above. Calibrated learning is focused on the idea that a system should learn when it is systematically over/under confident, correcting such predictable biases. KWIK learning is more focused on not making claims when you have insufficient evidence to pinpoint the right answer.
Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.
I don’t think the inner alignment problem, or the unintended optimization problem, reduce to calibrated learning (or KWIK learning). However, my reasons are somewhat complex. I think it is reasonable to try to make those reductions, so long as you grapple with the real issues.
I’m not super hung up on statistical guarantees, as I haven’t yet seen a way to make them in general which doesn’t require making some sort of unreasonable or impractical assumption about the world (and I’m skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.
Statistical guarantees are just a way to be able to say something with confidence. I agree that they’re often impractical, and therefore only a toy model of how things can work (at best). However, I’m not very sympathetic to attempts to solve the problem without actually-quite-strong arguments for alignment-relevant properties being made somehow. The question is, how?
Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?
If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.
In any case… I’m not exactly sure what you mean by “calibration”, but when I say “calibration”, I refer to “knowing what you know”. For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I’m reasonably “well-calibrated”; that is, I have a sense of what I do and don’t know.
A calibrated AI system, to me, is one that correctly says “this thing I’m looking at is an unusual thing I’ve never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high”.
Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.
I’m not super hung up on statistical guarantees, as I haven’t yet seen a way to make them in general which doesn’t require making some sort of unreasonable or impractical assumption about the world (and I’m skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.
If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.
To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can’t easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.
Technically, you don’t need this assumption. As I wrote in this comment: “it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.”
(I haven’t read a lot about quantilization so I can’t say much about that. However, a superintelligent adversary seems like something to avoid.)
If there’s a systematic bias in the score, the thing with the highest score may not always be the best action to take. Calibrating the estimates may change the ranking of options.
For example, it could be that expected values above 0.99 are almost always significant overestimates, with an average true value of 0.5. A calibrated learner would observe this and systematically correct such items downwards. The new top choices would probably have values like 0.989 (if that’s the only correction applied).
This provides something of a guarantee that systematic Goodhart-type problems will eventually be recognized and corrected, to the extent which they occur.
A meta-rule like that, which corrects observed biases in the aggregate scores, isn’t easy to represent as a direct object-level hypothesis about the data. That’s why calibrated learning may not be Bayesian. And, without a calibration guarantee, you’d need some other argument as to why representing uncertainty helps to avoid Goodhart.
As a very high level, first-pass approximation, I think the right way to think of this is as a sort of unit test; even if we can’t directly see a reason why systematically incorrect estimates would cause problems in an AI design, this is an obvious enough desiderata that we should by default assume a system which breaks it is bad, unless we can prove otherwise.
Closer to the object level—yes, the highest-scoring action is the correct action to take, and if you model miscalibration as a single, monotonic function applied as the last step before deciding, then it can’t change any decisions. But if miscalibration can affect any intermediate steps, then this doesn’t hold. As a simple example: suppose the AI is deciding whether to pay to preserve its access to a category of options which it knows are highly subject to Regressional Goodhart.
I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?
I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can’t be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won’t this run into siren worlds and so on, by default?
Yeah, it seems possible and interesting to formalize an argument like that.
The “adversary” can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you’ve got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you’ll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.
OTOH, although quantilization can be justified by a story involving an actual adversary, that’s not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you’re trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That’s basically what’s going on in the case of quantilization.
Similarly, “adversarial Goodhart” doesn’t have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.
The siren world scenario posits an AI that is “actually evil” and is an agent which makes plans to manipulate the user.
If the AI assigns decent credence to a utility function that assigns massive negative utility to “evil and unmitigated suffering”, that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.
I also don’t think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.
The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don’t think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won’t have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:
Make a system that’s capable of changing its values “online” in response to our input. Corrigibility lets us procrastinate on moral philosophy.
Instead of trying to build eutopia right off the bat, build an “optimal ivory tower” for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.
Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there’s even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see “Smile maximization case study” in this essay.
As I said, I think Goodhart’s law is largely about distributional shift. My scheme incentivizes the AI to mostly take “on-distribution” plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. “Off-distribution” plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.
Thanks for bringing this up, I’ll think about it. Part of me wants to say “if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!” Or: “If the overseer isn’t capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!” But maybe there’s an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)
I am a little frustrated with your reply (particularly the first half), but I’m not sure if you’re really missing my point (perhaps I’ll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I’ll have to think more about some of your points.
Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don’t clearly hold up when a lot of optimization pressure is applied.
The VOI incentive can’t be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible ui for the sake of others.
This would be untrue if, for example, the system maximized the weighted product (the weight wi is used as an exponent of the hypothesis ui). It would then actually never be worth it to entirely zero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I’m not bemoaning an inevitable fact of decision theory—there are alternatives.
This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn’t seem to hold up as a lot of optimization pressure is applied. Good plans will tend to be on-distribution, because that’s a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would the best plans tend to be on-distribution? Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Yeah, that’s the direction I’m thinking in. By the way—I’m not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you’ve got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven’t realized we want yet). I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Edit: related discussion here.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.
I think that’s right. Perhaps there is a small distinction to be brought out, but basically, extremal Goodhart is distributional shift brought about by the fact that the AI is optimizing hard.
Here’s what I mean by calibration: there’s a function from the probability you give to the frequency observed (or from the expected value you give to the average value observed), and the function approaches a straight x=y line as you learn. That’s basically what you describe in the example of the online test. However, in ML, there’s a difference between knows-what-it-knows learning (KWIK learning) and calibrated learning. KWIK learning is more like what you describe in the second paragraph above. Calibrated learning is focused on the idea that a system should learn when it is systematically over/under confident, correcting such predictable biases. KWIK learning is more focused on not making claims when you have insufficient evidence to pinpoint the right answer.
I don’t think the inner alignment problem, or the unintended optimization problem, reduce to calibrated learning (or KWIK learning). However, my reasons are somewhat complex. I think it is reasonable to try to make those reductions, so long as you grapple with the real issues.
Statistical guarantees are just a way to be able to say something with confidence. I agree that they’re often impractical, and therefore only a toy model of how things can work (at best). However, I’m not very sympathetic to attempts to solve the problem without actually-quite-strong arguments for alignment-relevant properties being made somehow. The question is, how?