I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn’t make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user’s beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).
More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user’s policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user’s subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI’s prior over universes and ϵ some sufficiently small bound. We require
∀T:Eυ∼ζ[(Eπυuμυ[Vυ(T)]−Eπaμυ[Vυ(T)])+]≤ϵ
Here, V(T) designates the value after event T happens, and (x)+ is defined to be 0 for x<0 and x otherwise.
I now outline an example of a Hippocratic protocol, which I call Hippocratic timeline-driven learning (HTDL).
Suppose, for simplicity, that the goal is accomplishing some binary task (i.e. we can either succeed or fail). We can imagine the task to be something like “construct a defense system against malign AI”. At any given moment, the user reports their subjective timeline: i.e. for, any t, the probabilities of success and failure after time t, assuming the user will continue working on the task by themselves. We assume the AI can learn a good model of the user’s policy and the user’s nominal report, but this nominal report might deviate from the true timeline because the system can be corrupted (i.e. the user manipulated to enter false information or the input channel hacked). Moreover, we assume that such corruption counts as failing the task.
The AI operates by maximizing the AI-expectation of the user’s (unknown) utility function Eυ∼ζ[Eπμυ[Uυ]], subject to the Hippocratic principle as a constraint. Since we’re working in a delegative setting, the AI can trivially achieve the constraint by always delegating. However, it can do much better than that, by short-term quantilizing w.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation. Then, after completing this short-term maneuver, the AI can do another short-term quantilization, and continue in this vein. Essentially, the AI is gradually bending the user’s trajectory towards success.
Of course, we can improve the guarantees of such a protocol either by adding more information or by using stronger assumptions (e.g. assume the user is quasi-rational). But timeline-driven short-term quantilization serves as a useful feasibility proof.
(Update: I don’t think this was 100% right, see here for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she’ll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H’s own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H’s perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself “gee you dunce you’re gonna regret that”, and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.
But we don’t want to do that naive thing. There are two problems:
The first problem is “traps” (a.k.a. catastrophes). Let’s say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn’t either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn’t take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.
But here we want to go further. Let’s say that there’s some action which H might do, where the AI can predict the consequence pretty well, and the consequence is “H will eventually feel more optimistic about success in the task”. Then the AI can go do that action, right? Not necessarily! The action might still be bad, because it involved corrupting the feedback channel.
So let’s say “action 789 = H will eat a pot brownie”. Taking that action makes H both much more optimistic about success and much less likely to actually succeed in the task. (It makes her assessments inaccurate.)
If the AI has no idea what pot brownies do, it won’t take action 789 because of the previous thing—maybe that action is a trap. But what if the AI has (indirectly) figured out, with 100% confidence, exactly what the consequence of action 789 would be?
In that case, since the AI’s decision criterion is all about H’s future reported odds of success, a naive algorithm would have the AI take action 789. To avoid that, we have this quantilization thing. How does that work?
Well, let’s say that H right now assigns 99% chance of eventual success. Therefore—assuming that H is well-calibrated and knows the consequences of action 789—it follows that H’s policy is to take that action with ≤1% probability. Basically, H is optimistic about success in the task, partly because H knows that she’s almost definitely not going to eat that pot brownie. I mean, she might eat it, if she gets really bored and frustrated and hungry. But probably not! So she’s feeling very optimistic.
OK, now the AI quantilizes by drawing 5 random samples from (its understanding of) H’s policy, and taking the action that (it believes) will lead to the highest future-optimism-about-success. Assume its beliefs are accurate. Compared to just doing what H would have done, this has downsides and upsides. The downside is, since action 789 scores really well by the AI’s decision criterion, the probability of action 789 is no longer ≤1%, but rather ≤5%. The upside is, if we’re not taking action 789, i.e. the other ≥95% of the probability mass, we’re probably doing something quite a bit better than what H would have done.
These could plausibly balance out such that the expected probability of success strictly increases, compared to what H would have done.
Please correct me if I’m misunderstanding anything.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by “continuously” I mean that we are allowed to compare keeping the AI on or turning off at any given moment.
“Corrigibility” is usually defined as the property of AIs who don’t resist modifications by their designers. Why would we want to perform such modifications? Mainly it’s because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user’s policy and preferences (be it through timelines, revealed preferences or anything else).
Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to “Hippocratic CIRL” (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn’t deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.
if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
I think the people most interested in corrigibility are imagining a situation where we know what we’re doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don’t even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we “figure out alignment.”
Maybe this is a strawman, because the thing I’m describing doesn’t make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).
The concept of corrigibility was introduced by MIRI, and I don’t think that’s their motivation? On my model of MIRI’s model, we won’t have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is “we won’t know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure”. Which, sure, but I don’t see what it has to do with corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It’s not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it’s not sufficient since an AI can be “corrigible” but cause catastrophic harm before someone notices and fixes it.
What we’re supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don’t say along which dimensions or how big the margin is. If it’s infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there’s no reason to talk about the former.
Very interesting—I’m sad I saw this 6 months late.
After thinking a bit, I’m still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually supports your proposal—it seems entirely plausible to make a chess engine that only makes moves that some given population of humans recognize as good, but is better than any human from that population.
On the other hand, the humans might be wrong about the reason the move is good, so that the game is made of a bunch of moves that seem good to humans, but where the humans are actually wrong about why they’re good (from the human perspective, this looks like regularly having “happy surprises”). We might hope that such human misevaluations are rare enough that quantilization would lead to moves on average being well-evaluated by humans, but for chess I think that might be false! Computers are so much better than humans at chess that a very large chunk of the best moves according to both humans and the computer will be ones that humans misevaluate.
Maybe that’s more a criticism of quantilizers, not a criticism of this desideratum. So maybe the chess example supports this being a good thing to want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off the stars until the universe is colder, but humans think it’s scary and ban the AI from doing scary things, the AI will still try to turn off the stars in one of the edge-case ways that humans wouldn’t find scary. And if we think being manipulated like that is bad and quantilize over actions to make the optimization milder, turning off the stars is still so important that a big chunk of the best moves according to both humans and the computer are going to be ones that humans misevaluate, and the computer knows will lead to a “happy surprise” of turning off the stars not being scary. Quantilization avoids policies that precisely exploit tiny features of the world, and it avoids off-distribution behavior, but it still lets the AI get what it wants if it totally outsmarts the humans.
The other thing this makes me think of is Lagrange multipliers. I bet there’s a duality between applying this constraint to the optimization process, and adding a bias (I mean, a useful prior) to the AI’s process for modeling U.
When I’m deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I’m good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn’t account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I’m good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an “oracle”. This allows the AI to have additional information which is purely “logical”. We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle’s outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it’s a good move, even without having the skill to understand why the move is good in any more detail.
Now let’s examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user’s. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
Agree with the first section, though I would like to register my sentiment that although “good at selecting but missing logical facts” is a better model, it’s still not one I’d want an AI to use when inferring my values.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
I think my point is if “turn off the stars” is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we’re probably going there.
Another way of looking at this claim is that I’m telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn’t deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user’s evaluation in the end of this short-term interval.
Ah. I indeed misunderstood, thanks :) I’d read “short-term quantilization” as quantilizing over short-term policies evaluated according to their expected utility. My story doesn’t make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
However, it can do much better than that, by short-term quantilizing w.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.
I don’t understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don’t see a distribution here, or even a clear ordering over actions (given that we don’t have access to the utility function).
I’m probably missing something obvious, but more details would really help.
The distribution is the user’s policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the “maneuver”. More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.
So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?
Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI’s probability of corruption is bounded by ϵϕ.
I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn’t specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.
I like this because it’s simple and obviously correct. Also I can see at least one way you could implement it:
a. Suppose the AI is ‘shadowing’ a human worker doing a critical task. Say it is ‘shadowing’ a human physician.
b. Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a ‘good’ physician would make, predicted outcome for the ‘good’ decision]. Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].
c. The machine also must have a confidence or this won’t work.
With large numbers and outright errors made by the physician, it’s then possible to detect all the cases where the [decision the physician made] has a substantially worse outcome than the [predicted decision a ‘good’ physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it’s time to call for a second opinion.
In the long run, of course, there will be a point where the [predicted decision a ‘good’ physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident.
(as an example, alpha go zero likely doesn’t benefit from asking another master go player for a ‘second opinion’ when it sees the player it is advising make a bad call)
I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn’t make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user’s beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).
More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user’s policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user’s subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI’s prior over universes and ϵ some sufficiently small bound. We require
∀T:Eυ∼ζ[(Eπυuμυ[Vυ(T)]−Eπaμυ[Vυ(T)])+]≤ϵ
Here, V(T) designates the value after event T happens, and (x)+ is defined to be 0 for x<0 and x otherwise.
I now outline an example of a Hippocratic protocol, which I call Hippocratic timeline-driven learning (HTDL).
Suppose, for simplicity, that the goal is accomplishing some binary task (i.e. we can either succeed or fail). We can imagine the task to be something like “construct a defense system against malign AI”. At any given moment, the user reports their subjective timeline: i.e. for, any t, the probabilities of success and failure after time t, assuming the user will continue working on the task by themselves. We assume the AI can learn a good model of the user’s policy and the user’s nominal report, but this nominal report might deviate from the true timeline because the system can be corrupted (i.e. the user manipulated to enter false information or the input channel hacked). Moreover, we assume that such corruption counts as failing the task.
The AI operates by maximizing the AI-expectation of the user’s (unknown) utility function Eυ∼ζ[Eπμυ[Uυ]], subject to the Hippocratic principle as a constraint. Since we’re working in a delegative setting, the AI can trivially achieve the constraint by always delegating. However, it can do much better than that, by short-term quantilizing w.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation. Then, after completing this short-term maneuver, the AI can do another short-term quantilization, and continue in this vein. Essentially, the AI is gradually bending the user’s trajectory towards success.
Of course, we can improve the guarantees of such a protocol either by adding more information or by using stronger assumptions (e.g. assume the user is quasi-rational). But timeline-driven short-term quantilization serves as a useful feasibility proof.
(Update: I don’t think this was 100% right, see here for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she’ll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H’s own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H’s perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself “gee you dunce you’re gonna regret that”, and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.
But we don’t want to do that naive thing. There are two problems:
The first problem is “traps” (a.k.a. catastrophes). Let’s say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn’t either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn’t take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.
But here we want to go further. Let’s say that there’s some action which H might do, where the AI can predict the consequence pretty well, and the consequence is “H will eventually feel more optimistic about success in the task”. Then the AI can go do that action, right? Not necessarily! The action might still be bad, because it involved corrupting the feedback channel.
So let’s say “action 789 = H will eat a pot brownie”. Taking that action makes H both much more optimistic about success and much less likely to actually succeed in the task. (It makes her assessments inaccurate.)
If the AI has no idea what pot brownies do, it won’t take action 789 because of the previous thing—maybe that action is a trap. But what if the AI has (indirectly) figured out, with 100% confidence, exactly what the consequence of action 789 would be?
In that case, since the AI’s decision criterion is all about H’s future reported odds of success, a naive algorithm would have the AI take action 789. To avoid that, we have this quantilization thing. How does that work?
Well, let’s say that H right now assigns 99% chance of eventual success. Therefore—assuming that H is well-calibrated and knows the consequences of action 789—it follows that H’s policy is to take that action with ≤1% probability. Basically, H is optimistic about success in the task, partly because H knows that she’s almost definitely not going to eat that pot brownie. I mean, she might eat it, if she gets really bored and frustrated and hungry. But probably not! So she’s feeling very optimistic.
OK, now the AI quantilizes by drawing 5 random samples from (its understanding of) H’s policy, and taking the action that (it believes) will lead to the highest future-optimism-about-success. Assume its beliefs are accurate. Compared to just doing what H would have done, this has downsides and upsides. The downside is, since action 789 scores really well by the AI’s decision criterion, the probability of action 789 is no longer ≤1%, but rather ≤5%. The upside is, if we’re not taking action 789, i.e. the other ≥95% of the probability mass, we’re probably doing something quite a bit better than what H would have done.
These could plausibly balance out such that the expected probability of success strictly increases, compared to what H would have done.
Please correct me if I’m misunderstanding anything.
This is about right.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility), but subjective from the human’s beliefs instead of the AI’s.
Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by “continuously” I mean that we are allowed to compare keeping the AI on or turning off at any given moment.
“Corrigibility” is usually defined as the property of AIs who don’t resist modifications by their designers. Why would we want to perform such modifications? Mainly it’s because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user’s policy and preferences (be it through timelines, revealed preferences or anything else).
Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to “Hippocratic CIRL” (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn’t deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.
I think the people most interested in corrigibility are imagining a situation where we know what we’re doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don’t even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we “figure out alignment.”
Maybe this is a strawman, because the thing I’m describing doesn’t make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).
The concept of corrigibility was introduced by MIRI, and I don’t think that’s their motivation? On my model of MIRI’s model, we won’t have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is “we won’t know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure”. Which, sure, but I don’t see what it has to do with corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It’s not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it’s not sufficient since an AI can be “corrigible” but cause catastrophic harm before someone notices and fixes it.
What we’re supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don’t say along which dimensions or how big the margin is. If it’s infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there’s no reason to talk about the former.
Very interesting—I’m sad I saw this 6 months late.
After thinking a bit, I’m still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually supports your proposal—it seems entirely plausible to make a chess engine that only makes moves that some given population of humans recognize as good, but is better than any human from that population.
On the other hand, the humans might be wrong about the reason the move is good, so that the game is made of a bunch of moves that seem good to humans, but where the humans are actually wrong about why they’re good (from the human perspective, this looks like regularly having “happy surprises”). We might hope that such human misevaluations are rare enough that quantilization would lead to moves on average being well-evaluated by humans, but for chess I think that might be false! Computers are so much better than humans at chess that a very large chunk of the best moves according to both humans and the computer will be ones that humans misevaluate.
Maybe that’s more a criticism of quantilizers, not a criticism of this desideratum. So maybe the chess example supports this being a good thing to want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off the stars until the universe is colder, but humans think it’s scary and ban the AI from doing scary things, the AI will still try to turn off the stars in one of the edge-case ways that humans wouldn’t find scary. And if we think being manipulated like that is bad and quantilize over actions to make the optimization milder, turning off the stars is still so important that a big chunk of the best moves according to both humans and the computer are going to be ones that humans misevaluate, and the computer knows will lead to a “happy surprise” of turning off the stars not being scary. Quantilization avoids policies that precisely exploit tiny features of the world, and it avoids off-distribution behavior, but it still lets the AI get what it wants if it totally outsmarts the humans.
The other thing this makes me think of is Lagrange multipliers. I bet there’s a duality between applying this constraint to the optimization process, and adding a bias (I mean, a useful prior) to the AI’s process for modeling U.
When I’m deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I’m good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn’t account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I’m good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an “oracle”. This allows the AI to have additional information which is purely “logical”. We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle’s outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it’s a good move, even without having the skill to understand why the move is good in any more detail.
Now let’s examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user’s. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
Agree with the first section, though I would like to register my sentiment that although “good at selecting but missing logical facts” is a better model, it’s still not one I’d want an AI to use when inferring my values.
I think my point is if “turn off the stars” is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we’re probably going there.
Another way of looking at this claim is that I’m telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn’t deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user’s evaluation in the end of this short-term interval.
Ah. I indeed misunderstood, thanks :) I’d read “short-term quantilization” as quantilizing over short-term policies evaluated according to their expected utility. My story doesn’t make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
I don’t understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don’t see a distribution here, or even a clear ordering over actions (given that we don’t have access to the utility function).
I’m probably missing something obvious, but more details would really help.
The distribution is the user’s policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the “maneuver”. More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.
Oh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI’s probability of corruption is bounded by ϵϕ.
Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn’t specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.
I like this because it’s simple and obviously correct. Also I can see at least one way you could implement it:
a. Suppose the AI is ‘shadowing’ a human worker doing a critical task. Say it is ‘shadowing’ a human physician.
b. Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a ‘good’ physician would make, predicted outcome for the ‘good’ decision]. Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].
c. The machine also must have a confidence or this won’t work.
With large numbers and outright errors made by the physician, it’s then possible to detect all the cases where the [decision the physician made] has a substantially worse outcome than the [predicted decision a ‘good’ physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it’s time to call for a second opinion.
In the long run, of course, there will be a point where the [predicted decision a ‘good’ physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident.
(as an example, alpha go zero likely doesn’t benefit from asking another master go player for a ‘second opinion’ when it sees the player it is advising make a bad call)