In those extreme and unprecedented situations, we could end up with revealed preferences pointing one way, stated preferences another, while regret and CEV point in different directions entierly. In that case, we might be tempted to ask “should we follow regret or stated preferences?” But that would be the wrong question to ask: our methods no longer correlated with each other, let alone with some fundamental measure of human values.
The last part of this doesn’t make sense to me. CEV is rather underdefined, but Paul’s indirect normativity (which you also cited in the OP as being in the same category as CEV) essentially consists of a group of virtual humans in some ideal virtual environment trying to determine the fundamental measure of human values as best as they can. Why would you not expect the output of that to be correlated with some fundamental measure of human values? If their output isn’t correlated with that, how can we expect to do any better?
Indirect normativity has specific failure mode—eg siren worlds, or social pressures going bad, or humans getting very twisted in that ideal environment in ways that we can’t yet predict. More to the point, these failure modes are ones that we can talk about from outside—we can say things like “these precautions should prevent the humans from getting too twisted, but we can’t fully guarantee it”.
That means that we can’t use indirect normativity as a definition of human values, as we already know how it could fail. A better understanding of what values are could result in being able to automate the checking as to whether it failed or not, which would me that we could include that in the definition.
More to the point, these failure modes are ones that we can talk about from outside
So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won’t be present.
Yes, that’s the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.
But some of these problems are issues that I specifically came up with. I don’t trust that idealised non-mes would necessarily have realised these problems even if put in that idealised situation. Or they might have come up with them too late, after they had already altered themselves.
I also don’t think that I’m particularly special, so other people can and will think up problems with the system that hadn’t occurred to me or anyone else.
This suggests that we’d need to include a huge amount of different idealised humans in the scheme. Which, in turn, increases the chance of the scheme failing due to social dynamics, unless we design it carefully ahead of time.
So I think it is highly valuable to get a lot of people thinking about the potential flaws and improvements for the system before implementing it.
That’s why I think that “punting to the wiser versions of ourselves” is useful, but not a sufficient answer. The better we can solve the key questions (“what are these ‘wiser’ versions?”, “how is the whole setup designed?”, “what questions exactly is it trying to answer?”), the better the wiser ourselves will be at their tasks.
The better we can solve the key questions (“what are these ‘wiser’ versions?”, “how is the whole setup designed?”, “what questions exactly is it trying to answer?”), the better the wiser ourselves will be at their tasks.
I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim “Unless we fully specify a correct theory of human values, we are doomed”.
I think that I’d view something like Paul’s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that’s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).
Yes, that’s the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.
It’s not clear what the relevant difference is between then and now, so the argument that it’s more important to solve a problem now is as suspect as the argument that the problem should be solved later.
How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.
We have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves.
Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves.
How are we currently in a better position to influence the outcome?
It’s not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.
The last part of this doesn’t make sense to me. CEV is rather underdefined, but Paul’s indirect normativity (which you also cited in the OP as being in the same category as CEV) essentially consists of a group of virtual humans in some ideal virtual environment trying to determine the fundamental measure of human values as best as they can. Why would you not expect the output of that to be correlated with some fundamental measure of human values? If their output isn’t correlated with that, how can we expect to do any better?
Indirect normativity has specific failure mode—eg siren worlds, or social pressures going bad, or humans getting very twisted in that ideal environment in ways that we can’t yet predict. More to the point, these failure modes are ones that we can talk about from outside—we can say things like “these precautions should prevent the humans from getting too twisted, but we can’t fully guarantee it”.
That means that we can’t use indirect normativity as a definition of human values, as we already know how it could fail. A better understanding of what values are could result in being able to automate the checking as to whether it failed or not, which would me that we could include that in the definition.
So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won’t be present.
Yes, that’s the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.
But some of these problems are issues that I specifically came up with. I don’t trust that idealised non-mes would necessarily have realised these problems even if put in that idealised situation. Or they might have come up with them too late, after they had already altered themselves.
I also don’t think that I’m particularly special, so other people can and will think up problems with the system that hadn’t occurred to me or anyone else.
This suggests that we’d need to include a huge amount of different idealised humans in the scheme. Which, in turn, increases the chance of the scheme failing due to social dynamics, unless we design it carefully ahead of time.
So I think it is highly valuable to get a lot of people thinking about the potential flaws and improvements for the system before implementing it.
That’s why I think that “punting to the wiser versions of ourselves” is useful, but not a sufficient answer. The better we can solve the key questions (“what are these ‘wiser’ versions?”, “how is the whole setup designed?”, “what questions exactly is it trying to answer?”), the better the wiser ourselves will be at their tasks.
I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim “Unless we fully specify a correct theory of human values, we are doomed”.
I think that I’d view something like Paul’s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that’s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).
I currently agree with this view. But I’d add that a theory of human values is a direct way to solve some of the critical considerations.
It’s not clear what the relevant difference is between then and now, so the argument that it’s more important to solve a problem now is as suspect as the argument that the problem should be solved later.
How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.
We have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves.
Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves.
It’s not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.