That being said I can imagine scenarios where humans involving AGIs too early in a value-reflective process can be worse than say, humans just engaging in moral reflection without an AGI. For instance I consider utilitarianism a basically incorrect model of human ethics, however it is possible we hardcode utility functions into an AGI which may force any reflection we do with the help of the AGI to be restricted in certain ways. I don’t mean to debate pros or cons of any specific moral philosophy, it’s just that when we’re deeply confused about some aspects of moral philosophy ourselves it’s difficult to ask an AI to solve that for us without hardcoding certain biases or assumptions into the AI. This problem may be harder than the minimal alignment problem of not killing most humans.
I also think this is a problem outside of moral philosophy—in general, the risk of hardcoding metaphysical, epistemic or technical assumptions into the AI, where we do not even know what assumptions we are smuggling in via doing this. Biological humans might make progress on these questions because we can’t just erase the parts of us that are confused (not without neurosurgery or uploading or something). But we can fail to transmit our confusion to the AI, and the AI might be confident about something that is incorrect or not what we wanted it to believe.
In general, this is a crux for me. I have fairly significant probability on moral realism is false, but in general I conceptualize alignment as “how to reliably make an AI that implements values at all without deception?
In general, this is a crux for me. I have fairly significant probability on moral realism is false, but in general I conceptualize alignment as “how to reliably make an AI that implements values at all without deception?