I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD). And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)
Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
(I’m probably not going to justify this sorry.)