I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
You need them to generalize extemely far. I’m also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I’m quite skeptical.
As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it’s easy to label if you’re willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won’t have issues with labeling).
I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You’ll literally never sample such conditionals in training.
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects.
Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.
To be clear, I was presenting this counterexample as a worst case theory counterexample: it’s not that the exact situation obviously applies, it’s just that it means (I think) that the proposal doesn’t achieve it’s guarantees in at least one case, so likely it fails in a bunch of other cases.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
You need them to generalize extemely far. I’m also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I’m quite skeptical.
As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it’s easy to label if you’re willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won’t have issues with labeling).
I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You’ll literally never sample such conditionals in training.
Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.
To be clear, I was presenting this counterexample as a worst case theory counterexample: it’s not that the exact situation obviously applies, it’s just that it means (I think) that the proposal doesn’t achieve it’s guarantees in at least one case, so likely it fails in a bunch of other cases.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD). And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
(I’m probably not going to justify this sorry.)