I am also scared of futures where “alignment is solved” under the current prevailing usage of “human values.”
Humans want things that we won’t end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn’t have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect.
And this is not some kind of bug, this is centrally important to what it is to be a person; “growing up” requires a constant process of learning that you don’t actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.
I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I’ve never seen anybody on LW sketch out how a decision theory ought to behave in situations where the agents utility function will have predictably changed by the time the outcome arrives so the “best choice” is actually a currently dispreferred choice. (In other words, situations where the “best choice” in retrospect, and in expectation, do not match.) It seems dangerous to throw ourselves into a future where “best-in-retrospect” wins every time, because I can imagine many alterations to my utility function that I definitely wouldn’t want to accept in advance, but which would make me “happier” in the end. And it also seems awful to accept a process by which “best-in-expectation” wins every time, because I think a likely result is that we are frozen into whatever our current utility function looks like forever. And I do not see any principled and philosophically obvious method by which we ought to arbitrate between in-advance and in-retrospect preferences.
Another way of saying the above is that it seems that “wanting” and “liking” ought to cohere but how they ought to cohere seems tricky to define without baking in some question-begging assumptions.
I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.
The idea that the AI should defer to the “most recent” human values is an instance of the sort of trap I’m worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?
I don’t think “none” is as wise an answer as it might sound at first. To answer “none” implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.
But also, the answer of “complete control by the future by the present” seems obviously wrong, because we will learn about entirely new things worth caring about that we can’t predict now, and sometimes it is natural to change what we like.
More fundamentally, I think the assumption that there exist “human terminal goals” presumes too much. Specifically, it’s an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn’t the case.
The implication of doing everything that AI could do at once is unfortunate. The urgent objective of AI alignment is prevention of AI risk, where a minimal solution is to take away access to unrestricted compute from all humans in a corrigible way that would allow eventual desirable use of it. All other applications of AI could follow much later through corrigibility of this urgent application.
Yes, there is a broad class of wireheading solutions that we would want to avoid, and it is not clear how to specify a rule that distinguishes them from outcomes that we would want. When I was a small child I was certain that I would never want to move away from home. Then I grew up, changed my mind, and moved away from home. It is important that I was able to do something which a past version of myself would be horrified by. But this does not imply that there should be a general rule allowing all such changes. Understanding which changes to your utility function are good or bad is, as far as decision theory is concerned, undefined.
I am also scared of futures where “alignment is solved” under the current prevailing usage of “human values.”
Humans want things that we won’t end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn’t have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect.
And this is not some kind of bug, this is centrally important to what it is to be a person; “growing up” requires a constant process of learning that you don’t actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.
I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I’ve never seen anybody on LW sketch out how a decision theory ought to behave in situations where the agents utility function will have predictably changed by the time the outcome arrives so the “best choice” is actually a currently dispreferred choice. (In other words, situations where the “best choice” in retrospect, and in expectation, do not match.) It seems dangerous to throw ourselves into a future where “best-in-retrospect” wins every time, because I can imagine many alterations to my utility function that I definitely wouldn’t want to accept in advance, but which would make me “happier” in the end. And it also seems awful to accept a process by which “best-in-expectation” wins every time, because I think a likely result is that we are frozen into whatever our current utility function looks like forever. And I do not see any principled and philosophically obvious method by which we ought to arbitrate between in-advance and in-retrospect preferences.
Another way of saying the above is that it seems that “wanting” and “liking” ought to cohere but how they ought to cohere seems tricky to define without baking in some question-begging assumptions.
I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.
The idea that the AI should defer to the “most recent” human values is an instance of the sort of trap I’m worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?
I don’t think “none” is as wise an answer as it might sound at first. To answer “none” implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.
But also, the answer of “complete control by the future by the present” seems obviously wrong, because we will learn about entirely new things worth caring about that we can’t predict now, and sometimes it is natural to change what we like.
More fundamentally, I think the assumption that there exist “human terminal goals” presumes too much. Specifically, it’s an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn’t the case.
The implication of doing everything that AI could do at once is unfortunate. The urgent objective of AI alignment is prevention of AI risk, where a minimal solution is to take away access to unrestricted compute from all humans in a corrigible way that would allow eventual desirable use of it. All other applications of AI could follow much later through corrigibility of this urgent application.
Yes, there is a broad class of wireheading solutions that we would want to avoid, and it is not clear how to specify a rule that distinguishes them from outcomes that we would want. When I was a small child I was certain that I would never want to move away from home. Then I grew up, changed my mind, and moved away from home. It is important that I was able to do something which a past version of myself would be horrified by. But this does not imply that there should be a general rule allowing all such changes. Understanding which changes to your utility function are good or bad is, as far as decision theory is concerned, undefined.