Rationality should be about winning, not avoiding the appearance of losing. In order to avoid the appearance of losing, we just have to look consistent in our choices. But in order to win, we have to find the right utility function and maximize that one. How likely is it that we’ll do that by prioritizing consistency with past choices above all else? (Which is clearly what this model is doing. It also seems to be the basic idea behind “unlosing agents”, but I’m less sure about that, so correct me if I’m wrong.) It seems to me that consistency ought to naturally fall out of doing the right things to win. Inconsistency is certainly a sign that something is wrong, but there is no point in aiming for it directly like it’s a good thing in itself.
(sorry for giving you another answer, but it seems it’s useful to separate the points):
But in order to win, we have to find the right utility function and maximize that one.
The “right” utility function. Which we don’t currently know. And yet we can make decisions until the day we get it, and still make ourselves unexploitable in the meantime. The first AIs may well be in a similar situation.
A second, maybe more pertinent point: unlosing agents are also unexploitable (maybe I should have called them that to begin with). This is a very useful thing for any agent to be, especially one who’s values are not yet jelled (just as, eg, a FAI still in the process of estimating the CEV).
I don’t see how unlosing agents are compatible with CEV. Running the unlosing algorithm gives you one utility function at the end, and running CEV gives you another. They would be the same only by coincidence. If you start by giving control to the unlosing algorithm, why would it then hand over control to CEV or change its utility function to CEV’s output (or not remove whatever hardwired switchover mechanism you might put in)? Your third comment seems make essentially the same point as your second (this) comment, and the same response seems to apply.
Before running CEV, we are going to have to make a lot of decisions about what constitutes a human utility function, how to extract it, how to assess strength of opinions, how to resolve conflicts between different preferences in the same person, how to aggregate it, and so on. So the ultimate CEV is path dependent, dependent on the outcome of those choices.
Using an unlosing algorithm could be seen as starting on the path to CEV earlier, before humans have made all those decisions, and letting the algorithm solve make some of those decisions rather than ourselves. This could be useful if some of the components on the path to CEV are things where our meta-decision skills (for instructing the unlosing agent how to resolve these issues) are better than our object level decision skills (for resolving the issues directly).
You say “unlosing design seems indicated” for value loading, but I don’t see how these two ideas would work together at all. Can you give some sort of proof of concept design? Also, as I mentioned before, value loading seems to be compatible with UDT. What advantage does a value loading, unlosing agent have, over a UDT-based value loader?
On a more philosophical note, we seem to have a different approach. It’s my impression that you want to construct an idealised perfect system, and then find a way of applying it down to the real world. I seem to be coming up with tools that would allow people to take approximate “practical” ideas that have no idealised versions, and apply them in ways that are less likely to cause problems.
The reason an unlosing agent might be interesting is that it doesn’t have to have its values specified as a collection of explicit utility functions. It could instead have some differently specified system that converges to explicit utility functions as it gets more morally relevant data. Then an unlosing procedure would keep it unexploitable during this process.
In practice, I think requiring a value-loading agent to be unlosing might be too much of a requirement, as it might lock in some early decisions. I see a “mainly unlosing” agent as being more interesting—say an imperfect value loading agent with some unlosing characteristics—as being potentially safer.
Rationality should be about winning, not avoiding the appearance of losing. In order to avoid the appearance of losing, we just have to look consistent in our choices. But in order to win, we have to find the right utility function and maximize that one. How likely is it that we’ll do that by prioritizing consistency with past choices above all else? (Which is clearly what this model is doing. It also seems to be the basic idea behind “unlosing agents”, but I’m less sure about that, so correct me if I’m wrong.) It seems to me that consistency ought to naturally fall out of doing the right things to win. Inconsistency is certainly a sign that something is wrong, but there is no point in aiming for it directly like it’s a good thing in itself.
(sorry for giving you another answer, but it seems it’s useful to separate the points):
The “right” utility function. Which we don’t currently know. And yet we can make decisions until the day we get it, and still make ourselves unexploitable in the meantime. The first AIs may well be in a similar situation.
A second, maybe more pertinent point: unlosing agents are also unexploitable (maybe I should have called them that to begin with). This is a very useful thing for any agent to be, especially one who’s values are not yet jelled (just as, eg, a FAI still in the process of estimating the CEV).
I don’t see how unlosing agents are compatible with CEV. Running the unlosing algorithm gives you one utility function at the end, and running CEV gives you another. They would be the same only by coincidence. If you start by giving control to the unlosing algorithm, why would it then hand over control to CEV or change its utility function to CEV’s output (or not remove whatever hardwired switchover mechanism you might put in)? Your third comment seems make essentially the same point as your second (this) comment, and the same response seems to apply.
Before running CEV, we are going to have to make a lot of decisions about what constitutes a human utility function, how to extract it, how to assess strength of opinions, how to resolve conflicts between different preferences in the same person, how to aggregate it, and so on. So the ultimate CEV is path dependent, dependent on the outcome of those choices.
Using an unlosing algorithm could be seen as starting on the path to CEV earlier, before humans have made all those decisions, and letting the algorithm solve make some of those decisions rather than ourselves. This could be useful if some of the components on the path to CEV are things where our meta-decision skills (for instructing the unlosing agent how to resolve these issues) are better than our object level decision skills (for resolving the issues directly).
If you’re going to design an agent with an adaptable utility (eg a value loader) then something like an unlosing design seems indicated.
You say “unlosing design seems indicated” for value loading, but I don’t see how these two ideas would work together at all. Can you give some sort of proof of concept design? Also, as I mentioned before, value loading seems to be compatible with UDT. What advantage does a value loading, unlosing agent have, over a UDT-based value loader?
On a more philosophical note, we seem to have a different approach. It’s my impression that you want to construct an idealised perfect system, and then find a way of applying it down to the real world. I seem to be coming up with tools that would allow people to take approximate “practical” ideas that have no idealised versions, and apply them in ways that are less likely to cause problems.
Would you say that is a fair assessment?
The reason an unlosing agent might be interesting is that it doesn’t have to have its values specified as a collection of explicit utility functions. It could instead have some differently specified system that converges to explicit utility functions as it gets more morally relevant data. Then an unlosing procedure would keep it unexploitable during this process.
In practice, I think requiring a value-loading agent to be unlosing might be too much of a requirement, as it might lock in some early decisions. I see a “mainly unlosing” agent as being more interesting—say an imperfect value loading agent with some unlosing characteristics—as being potentially safer.