Why wouldn’t they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
I’m just saying that the case is not clear, and it seems like we’d want the case to be clear.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution.
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
Quantilization follows fairly directly from that :)
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.
I know I’m being a little fuzzy about realizability. Let’s consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you’ve come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you’ve seen and also has your unusual idea inadvertently killing the alien. (This is a bit like “murphyjitsu” from CFAR.) If you aren’t able to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea… then you probably aren’t super smart.
You have to start somewhere. Discussions like this can help make things clear :) I’m getting value from it… you’ve given me some things to think about, and I think the murphyjitsu idea is something I hadn’t thought of previously :)
I think it often makes sense to reason at an informal level before proceeding to a formal one.
Edit: related discussion here.
That’s not the case I’m considering. I’m imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.
Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are “on-distribution”, ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for “off-distribution” plans.
There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is “smaller”; there’s room for a lot to happen in the off-distribution space. That’s why it reaches weird corner cases.
And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)
When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they’re off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)
I realize I’m playing fast and loose with realizability again, but it seems to me that a system which is capable of being “calibrated”, in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I’m not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a “calibrated” ML system in the sense that I’m using the term.
Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.
If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that’s fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).
My current understanding of quantilization is “choose randomly from the top X% of actions”. I don’t see how this helps very much with staying on-distribution… as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.
In any case, quantilization seems like it shouldn’t work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth’s atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren’t very valuable.
The base distribution you take the top X% of is supposed to be related to the “on-distribution” distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer’s own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently “off-distribution” possibilities for there to be a concern. (I’m not saying these are entirely reasonable assumptions; I’m just saying that this is one way of thinking about quantilization.)
The base distribution quantilization samples from is about actions, or plans, or policies, or things like that—not about configurations of atoms.
So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.