But the theorems are evidence that RL leads to catastrophe at optimum, at least.
RL with a randomly chosen reward leads to catastrophe at optimum.
Iprovedthat that optimal policies are generally power-seeking in MDPs.
The proof is for randomly distributed rewards.
Ben’s main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.
(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don’t error.)
I do agree that this isn’t enough reason to say “there is no risk”, but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Every specific attempt so far has been seemingly unsuccessful
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I don’t know that it makes me feel much better about future objectives being outer aligned.
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
RL with a randomly chosen reward leads to catastrophe at optimum.
The proof is for randomly distributed rewards.
Ben’s main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.
(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don’t error.)
I do agree that this isn’t enough reason to say “there is no risk”, but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I think I agree with most of what you’re arguing.