Roko is adopting a special and unusual metamoral framework in regarding “Most agents do X!” as a compelling reason to change one’s utility function. Why might Roko find this appealing? Humans, for very understandable reasons of evolutionary psychology, have a universalizing instinct; we think that a valid argument should persuade anyone.
Perhaps this can be fixed; maybe if we say Q:=”moral(X):=”A supermajority of agents which accept Q consider X moral”″. Then agents accepting Q cannot agree to disagree, and Q-based arguments are capable of convincing any Q-implementing agent.
On the other hand, the universe could stably be in a state in which agents which accept Q mostly believe moral(torture), in which case they all continue to do so. However, this is unsurprising; there is no way to force everyone to agree on what is “moral” (no universally compelling arguments), so why should Q-agents necessarily agree with us?
But what we are left with seems to be a strange loop through the meta-level, with the distinction that it loops through not only the agent’s own meta-level but also the agent’s beliefs about other Q-agents’ beliefs.
However, I’m stripping out the bit about making instrumental values terminal, because I can’t see the point of it (and of course it leads to the “drive a car!” problem). Instead we take Q as our only terminal value; the shared pool of things-that-look-like-terminal-values {X : Q asserts moral(X)} is in fact our first layer of instrumental values.
Also, I’m not endorsing the above as a coherent or effective metaethics. I’m just wondering whether it’s possible that it could be coherent or effective. In particular, is it PA+1 or Self-PA? Does it exhibit the failure mode of the Type 2 Calculator? After all, the system as a whole is defined as outputting what it outputs, but individual members are defined as outputting what everyone else outputs and therefore, um, my head hurts.
Perhaps this can be fixed; maybe if we say Q:=”moral(X):=”A supermajority of agents which accept Q consider X moral”″. Then agents accepting Q cannot agree to disagree, and Q-based arguments are capable of convincing any Q-implementing agent.
On the other hand, the universe could stably be in a state in which agents which accept Q mostly believe moral(torture), in which case they all continue to do so. However, this is unsurprising; there is no way to force everyone to agree on what is “moral” (no universally compelling arguments), so why should Q-agents necessarily agree with us?
But what we are left with seems to be a strange loop through the meta-level, with the distinction that it loops through not only the agent’s own meta-level but also the agent’s beliefs about other Q-agents’ beliefs.
However, I’m stripping out the bit about making instrumental values terminal, because I can’t see the point of it (and of course it leads to the “drive a car!” problem). Instead we take Q as our only terminal value; the shared pool of things-that-look-like-terminal-values {X : Q asserts moral(X)} is in fact our first layer of instrumental values.
Also, I’m not endorsing the above as a coherent or effective metaethics. I’m just wondering whether it’s possible that it could be coherent or effective. In particular, is it PA+1 or Self-PA? Does it exhibit the failure mode of the Type 2 Calculator? After all, the system as a whole is defined as outputting what it outputs, but individual members are defined as outputting what everyone else outputs and therefore, um, my head hurts.