I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.
The idea that the AI should defer to the “most recent” human values is an instance of the sort of trap I’m worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?
I don’t think “none” is as wise an answer as it might sound at first. To answer “none” implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.
But also, the answer of “complete control by the future by the present” seems obviously wrong, because we will learn about entirely new things worth caring about that we can’t predict now, and sometimes it is natural to change what we like.
More fundamentally, I think the assumption that there exist “human terminal goals” presumes too much. Specifically, it’s an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn’t the case.
I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.
The idea that the AI should defer to the “most recent” human values is an instance of the sort of trap I’m worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?
I don’t think “none” is as wise an answer as it might sound at first. To answer “none” implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.
But also, the answer of “complete control by the future by the present” seems obviously wrong, because we will learn about entirely new things worth caring about that we can’t predict now, and sometimes it is natural to change what we like.
More fundamentally, I think the assumption that there exist “human terminal goals” presumes too much. Specifically, it’s an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn’t the case.