I’m sympathetic to some of your arguments but even if we accept that the current paradigm will lead us to an AI that is pretty similar to a human mind, and even in the best case I’m already not super optimistic that a scaled up random almost human is a great outcome. I simply disagree where you say this:
>For example, humans are not perfectly robust. I claim that for any human, no matter how moral, there exist adversarial sensory inputs that would cause them to act badly. Such inputs might involve extreme pain, starvation, exhaustion, etc. I don’t think the mere existence of such inputs means that all humans are unaligned.
Humans aren’t that aligned at the extreme and the extreme matters when talking about the smartest entity making every important decision about everything.
Also, your general arguments about the current paradigms being not that bad are reasonable but again, I think our situation is a lot closer to all or nothing—if we get pretty far with RLHF or whatever, scale up the model until it’s extremely smart and thus eventually making every decision of consequence then unless you got the alignment near perfectly the chance that the remaining problematic parts screw us over seems uncomfortably high to me.
I’m sympathetic to some of your arguments but even if we accept that the current paradigm will lead us to an AI that is pretty similar to a human mind, and even in the best case I’m already not super optimistic that a scaled up random almost human is a great outcome. I simply disagree where you say this:
>For example, humans are not perfectly robust. I claim that for any human, no matter how moral, there exist adversarial sensory inputs that would cause them to act badly. Such inputs might involve extreme pain, starvation, exhaustion, etc. I don’t think the mere existence of such inputs means that all humans are unaligned.
Humans aren’t that aligned at the extreme and the extreme matters when talking about the smartest entity making every important decision about everything.
Also, your general arguments about the current paradigms being not that bad are reasonable but again, I think our situation is a lot closer to all or nothing—if we get pretty far with RLHF or whatever, scale up the model until it’s extremely smart and thus eventually making every decision of consequence then unless you got the alignment near perfectly the chance that the remaining problematic parts screw us over seems uncomfortably high to me.