The “act as if it doesn’t believe its messages will be read” is part of its value function, not its decision theory. So we are only requiring the value function to be stable over self improvement.
What I mean is that I haven’t wired the decision theory to something odd (which might be removed by self improvement), just chosen a particular value system (which has much higher chance of being preserved by self improvement).
The “act as if it doesn’t believe its messages will be read” is part of its value function, not its decision theory. So we are only requiring the value function to be stable over self improvement.
Why is that? The value function tells you what is important, but the “act” part requires decision theory.
What I mean is that I haven’t wired the decision theory to something odd (which might be removed by self improvement), just chosen a particular value system (which has much higher chance of being preserved by self improvement).