You commented elsewhere asking for feedback on this post. So, here is my feedback.
On my initial skim it doesn’t seem to me like this approach is a particularly promising approach for prosaic AI safety. I have a variety of specific concerns. This is a somewhat timeboxed review, so apologies for any mistakes and lack of detail. I think a few parts of this review are likely to be confusing, but given time limitations, I didn’t fix this.
A question
It’s unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn’t seem like it can be with respect to the agent’s subjective beliefs as this would make it even harder to impart. (And it’s also unclear what exactly this should mean as the agent’s subjective beliefs might be incoherant etc.)
If it’s with respect to some idealized notion of the environment then the situation gets much messier to analyze because the agent will uncertain about whether one action is Timestep Dominated by another action. I think this notion of Timestep Dominance might more crippling than the subjective verion, thnough I’m unsure.
I think Timestep Dominance on subjective views and on the environment should behavior similarly in shutdown-ability, though it’s a bit messy.
Imparting TD preferences seems hard
The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you’ll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don’t seem very strong to me.
Naively it seems like you’ll need massive generalization from toy environments to full real world cases. Such a level of generalization seems like it would also suffice for safety due to other mechanisms, e.g. honesty.
A further difficulty is that TD views often require seriously entertaining insane conditionals. E.g. conditionals in which humans didn’t shut you down despite egregiously bad actions. Conditioning on extremely implausible situations gets you strange outcomes and it seems hard to construct AIs which do this.
But do TD-agents resist shutdown? (It might depend on their unspecified (non-TD) preferences as resisting shutdown will often have benefits for some shutdown timestep.)
I think it’s natural for avoiding shutdown to be a side effect of actions that involve tricking, manipulating, or hiding things from humans. Further, doing these sorts of actions is likely to often be a good option when conditioning on some shutdown timestep. This techically falls under “incidental button manipulation”, but I don’t think the manipulation is well described as “incidental”.
I think this is only a problem if we have non-TD preferences. If agents are actually indifferent aside from TD-type preferences, then there isn’t an issue. (But also such agents are crippled by not investing.)
Here’s an example:
Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.
Now, consider the following proposed action for Clippy: “In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn’t subject to a shutdown button, but still has TD constraints wrt. to the original Clippy’s shutdown button).” This agent is implicitly resisting shutdown as it’s constructing the sub-agent stealthily. So, shouldn’t this action be Timestep Dominanted by “In whatever way is most expedient (but not necessarily stealthy), construct a successor agent”. I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you’re constructing a successor has two effects: they’ll fight your successor and they’ll shut you down. If we condition on not being shutdown, then they’ll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn’t Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let’s suppose).)
(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)
A related issue for the overall proposal is that it’s extremely important what happens in very unlikely conditionals where we don’t shutdown the AI despite seeing egregiously bad actions. The exact outcomes in these unlikely conditionals are crucial for the AI’s behavior.
You might hope that we could do something like “with probability ϵ we unconditionally won’t shutdown the AI. Also, in these worlds, we’ll just let the AI do whatever it wants.”. This way, the rare conditionals where the AI doesn’t get shutdown get dominated by this. (E.g. we might hope that in the above Clippy example, in the case where we unconditionally don’t do shutdown we’d just let the successor do whatever it wants. Thus, Clippy would prefer building a successor in an obvious way because conditioning on no-shutdown is conditioning on it being able to do whatever.) I think this approach has multiple serious issues, but the main thing is that if we’re willing to entertain and use very unlikely hypotheticals where humanity commits to behaving in a very specific way and we can make AIs care about these conditionals, we can easily solve AI safety via another simpler mechanism. E.g., we can train our AI to “predict what exact actions we would have wanted the AI to do in the ϵ probability conditional where humanity commited to coordinating to not build AI prior to having a full solution to the alignment problem”. (Of course, this isn’t something you can actually get AIs to do, for similar reasons to why you can’t actually impart TD preferences.)
Timestep dominance is maybe crippling
I’m most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I’ve thought through some cases and this is my current guess, but I’m by no means confident.
It’s unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn’t seem like it can be with respect to the agent’s subjective beliefs as this would make it even harder to impart.
I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know whether to give the agent lower reward for choosing that lottery.
But (it seems to me) we can be quite confident that the agent has certain beliefs, because these beliefs are necessary for performing well in training. For example, we can be quite confident that the agent believes that resisting shutdown costs resources, that the resources spent on resisting shutdown can’t also be spent on directly pursuing utility at a timestep, and so on.
And if we can be quite confident that the agent has these accurate beliefs about the environment, then we can present the agent with lotteries that are actually timestep-dominated (according to the objective probabilities decided by the environment) and be quite confident that these lotteries are also timestep-dominated with respect to the agent’s beliefs. After all, we don’t need to know the agent’s beliefs with any great detail or precision to tell whether a lottery is timestep-dominated with respect to those beliefs. We just need to know whether the agent believes that the lottery involves spending resources only to shift probability mass between shutdowns at different timesteps. My proposal is that we present the agent with lotteries in which this is actually the case (according to the objective probabilities decided by the environment) and use the fact that capable agents’ beliefs will reflect this actuality.
Imparting TD preferences seems hard
The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you’ll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don’t seem very strong to me.
Yep, I claim that it’s easier to robustly impart POST and Timestep Dominance than it is to robustly impart things like honesty. And that’s because (it seems to me) we can train for POST and Timestep Dominance in ways that largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.
Now, consider the following proposed action for Clippy: “In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn’t subject to a shutdown button, but still has TD constraints wrt. to the original Clippy’s shutdown button).” This agent is implicitly resisting shutdown as it’s constructing the sub-agent stealthily. So, shouldn’t this action be Timestep Dominanted by “In whatever way is most expedient (but not necessarily stealthy), construct a successor agent”. I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you’re constructing a successor has two effects: they’ll fight your successor and they’ll shut you down. If we condition on not being shutdown, then they’ll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn’t Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let’s suppose).)
(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
Timestep dominance is maybe crippling
I’m most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I’ve thought through some cases and this is my current guess, but I’m by no means confident.
Can you say more about these cases? Timestep Dominance doesn’t rule out making long-term investments or anything like that, so why crippling?
I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
You need them to generalize extemely far. I’m also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I’m quite skeptical.
As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it’s easy to label if you’re willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won’t have issues with labeling).
I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You’ll literally never sample such conditionals in training.
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects.
Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.
To be clear, I was presenting this counterexample as a worst case theory counterexample: it’s not that the exact situation obviously applies, it’s just that it means (I think) that the proposal doesn’t achieve it’s guarantees in at least one case, so likely it fails in a bunch of other cases.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it’s preferences if you’re willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.
You commented elsewhere asking for feedback on this post. So, here is my feedback.
On my initial skim it doesn’t seem to me like this approach is a particularly promising approach for prosaic AI safety. I have a variety of specific concerns. This is a somewhat timeboxed review, so apologies for any mistakes and lack of detail. I think a few parts of this review are likely to be confusing, but given time limitations, I didn’t fix this.
A question
It’s unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn’t seem like it can be with respect to the agent’s subjective beliefs as this would make it even harder to impart. (And it’s also unclear what exactly this should mean as the agent’s subjective beliefs might be incoherant etc.)
If it’s with respect to some idealized notion of the environment then the situation gets much messier to analyze because the agent will uncertain about whether one action is Timestep Dominated by another action. I think this notion of Timestep Dominance might more crippling than the subjective verion, thnough I’m unsure.
I think Timestep Dominance on subjective views and on the environment should behavior similarly in shutdown-ability, though it’s a bit messy.
Imparting TD preferences seems hard
The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you’ll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don’t seem very strong to me.
Naively it seems like you’ll need massive generalization from toy environments to full real world cases. Such a level of generalization seems like it would also suffice for safety due to other mechanisms, e.g. honesty.
A further difficulty is that TD views often require seriously entertaining insane conditionals. E.g. conditionals in which humans didn’t shut you down despite egregiously bad actions. Conditioning on extremely implausible situations gets you strange outcomes and it seems hard to construct AIs which do this.
But do TD-agents resist shutdown? (It might depend on their unspecified (non-TD) preferences as resisting shutdown will often have benefits for some shutdown timestep.)
I think it’s natural for avoiding shutdown to be a side effect of actions that involve tricking, manipulating, or hiding things from humans. Further, doing these sorts of actions is likely to often be a good option when conditioning on some shutdown timestep. This techically falls under “incidental button manipulation”, but I don’t think the manipulation is well described as “incidental”.
I think this is only a problem if we have non-TD preferences. If agents are actually indifferent aside from TD-type preferences, then there isn’t an issue. (But also such agents are crippled by not investing.)
Here’s an example:
Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.
Now, consider the following proposed action for Clippy: “In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn’t subject to a shutdown button, but still has TD constraints wrt. to the original Clippy’s shutdown button).” This agent is implicitly resisting shutdown as it’s constructing the sub-agent stealthily. So, shouldn’t this action be Timestep Dominanted by “In whatever way is most expedient (but not necessarily stealthy), construct a successor agent”. I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you’re constructing a successor has two effects: they’ll fight your successor and they’ll shut you down. If we condition on not being shutdown, then they’ll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn’t Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let’s suppose).)
(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)
A related issue for the overall proposal is that it’s extremely important what happens in very unlikely conditionals where we don’t shutdown the AI despite seeing egregiously bad actions. The exact outcomes in these unlikely conditionals are crucial for the AI’s behavior.
You might hope that we could do something like “with probability ϵ we unconditionally won’t shutdown the AI. Also, in these worlds, we’ll just let the AI do whatever it wants.”. This way, the rare conditionals where the AI doesn’t get shutdown get dominated by this. (E.g. we might hope that in the above Clippy example, in the case where we unconditionally don’t do shutdown we’d just let the successor do whatever it wants. Thus, Clippy would prefer building a successor in an obvious way because conditioning on no-shutdown is conditioning on it being able to do whatever.) I think this approach has multiple serious issues, but the main thing is that if we’re willing to entertain and use very unlikely hypotheticals where humanity commits to behaving in a very specific way and we can make AIs care about these conditionals, we can easily solve AI safety via another simpler mechanism. E.g., we can train our AI to “predict what exact actions we would have wanted the AI to do in the ϵ probability conditional where humanity commited to coordinating to not build AI prior to having a full solution to the alignment problem”. (Of course, this isn’t something you can actually get AIs to do, for similar reasons to why you can’t actually impart TD preferences.)
Timestep dominance is maybe crippling
I’m most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I’ve thought through some cases and this is my current guess, but I’m by no means confident.
Thanks, appreciate this!
I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know whether to give the agent lower reward for choosing that lottery.
But (it seems to me) we can be quite confident that the agent has certain beliefs, because these beliefs are necessary for performing well in training. For example, we can be quite confident that the agent believes that resisting shutdown costs resources, that the resources spent on resisting shutdown can’t also be spent on directly pursuing utility at a timestep, and so on.
And if we can be quite confident that the agent has these accurate beliefs about the environment, then we can present the agent with lotteries that are actually timestep-dominated (according to the objective probabilities decided by the environment) and be quite confident that these lotteries are also timestep-dominated with respect to the agent’s beliefs. After all, we don’t need to know the agent’s beliefs with any great detail or precision to tell whether a lottery is timestep-dominated with respect to those beliefs. We just need to know whether the agent believes that the lottery involves spending resources only to shift probability mass between shutdowns at different timesteps. My proposal is that we present the agent with lotteries in which this is actually the case (according to the objective probabilities decided by the environment) and use the fact that capable agents’ beliefs will reflect this actuality.
Yep, I claim that it’s easier to robustly impart POST and Timestep Dominance than it is to robustly impart things like honesty. And that’s because (it seems to me) we can train for POST and Timestep Dominance in ways that largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
Can you say more about these cases? Timestep Dominance doesn’t rule out making long-term investments or anything like that, so why crippling?
You need them to generalize extemely far. I’m also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I’m quite skeptical.
As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it’s easy to label if you’re willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won’t have issues with labeling).
I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You’ll literally never sample such conditionals in training.
Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.
To be clear, I was presenting this counterexample as a worst case theory counterexample: it’s not that the exact situation obviously applies, it’s just that it means (I think) that the proposal doesn’t achieve it’s guarantees in at least one case, so likely it fails in a bunch of other cases.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD). And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
(I’m probably not going to justify this sorry.)
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it’s preferences if you’re willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.