Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Every specific attempt so far has been seemingly unsuccessful
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I don’t know that it makes me feel much better about future objectives being outer aligned.
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I think I agree with most of what you’re arguing.