Every specific attempt so far has been seemingly unsuccessful
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I don’t know that it makes me feel much better about future objectives being outer aligned.
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.
But I don’t really think I disagree with you here. I think this paper is good, provides support for the point “we should have good reason to believe an AI system is safe, and not assume it by default”, and responds to an in-fact incorrect argument of “but why would any AI want to kill us all, that’s just anthropomorphizing”.
But when someone says “These arguments depend on some concept of a ‘random mind’, but in reality it won’t be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply”, it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
Aside:
I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil’s advocate here. (I am also not convinced of the negation.)
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I think I agree with most of what you’re arguing.