Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
Sure, but this objection also seems to apply to POST/TD, but for “actually shutting the AI down because it acted catastrophically badly” vs “getting shutdown in cases where humans are in control”. It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.
What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
It seems like you’re assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
(I’m probably not going to justify this sorry.)