Some AI safety ideas delegate key decisions to our idealised selves. This is sometimes phrased as “allowing versions of yourself to think for ten thousand years”, or similar sentiments.
Occasionally, when I’ve objected to these ideas, it’s been pointed out that any attempt to construct a safe AI design would involve a lot of thinking, so therefore there can’t be anything wrong with delegating this thinking to an algorithm or an algorithmic version of myself.
But there is a tension between “more thinking” in the sense of “solve specific problems” and in the sense of “change your own values”.
An unrestricted “do whatever a copy of Stuart Armstrong would have done after he thought about morality for ten thousand years” seems to positively beg for value drift (worsened by the difficulty in defining what we mean by “a copy of Stuart Armstrong [...] thought [...] for ten thousand years”).
A more narrow “have ten copies of Stuart think about these ten theorems for a subjective week each and give me a proof or counter-example” seems much safer.
In between those two extremes, how do we assess the degree of value drift and its potential importance to the question being asked? Ideally, we’d have a theory of human values to help distinguish the cases. Even without that, we can use some common sense on issues like length of thought, nature of problem, bandwidth of output, and so on.
Would I think for ten thousand years?
Some AI safety ideas delegate key decisions to our idealised selves. This is sometimes phrased as “allowing versions of yourself to think for ten thousand years”, or similar sentiments.
Occasionally, when I’ve objected to these ideas, it’s been pointed out that any attempt to construct a safe AI design would involve a lot of thinking, so therefore there can’t be anything wrong with delegating this thinking to an algorithm or an algorithmic version of myself.
But there is a tension between “more thinking” in the sense of “solve specific problems” and in the sense of “change your own values”.
An unrestricted “do whatever a copy of Stuart Armstrong would have done after he thought about morality for ten thousand years” seems to positively beg for value drift (worsened by the difficulty in defining what we mean by “a copy of Stuart Armstrong [...] thought [...] for ten thousand years”).
A more narrow “have ten copies of Stuart think about these ten theorems for a subjective week each and give me a proof or counter-example” seems much safer.
In between those two extremes, how do we assess the degree of value drift and its potential importance to the question being asked? Ideally, we’d have a theory of human values to help distinguish the cases. Even without that, we can use some common sense on issues like length of thought, nature of problem, bandwidth of output, and so on.