That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.
That is not clear to me. If I haven’t yet worked out how to architect myself such that my values remain fixed, I’m not sure on what basis I can be confident that observing or inferring new things about the world won’t alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.
How would any epistemic insights change the terminal values you’d want to maximize? They’d change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn’t that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.
If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov’s robots on Solaria, who act in accordance with the First Law, but have ‘human’ redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you’re doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we’re used to working with, damned hard as a general problem.
Beats me, but I don’t see how a system that’s not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won’t alter them. If there’s some reason I should rely on it not to do so, I’d love an explanation (or pointer to an explanation) of that reason.
Certainly, I have no confidence that I’m architected so that epistemic insights can’t alter whatever it is in my brain that we’re talking about when we talk about my “terminal values.”
Any kind of self-modification would be out of the question until the AGI has solved the problem of keeping its utility function intact.
That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.
That is not clear to me. If I haven’t yet worked out how to architect myself such that my values remain fixed, I’m not sure on what basis I can be confident that observing or inferring new things about the world won’t alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.
How would any epistemic insights change the terminal values you’d want to maximize? They’d change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn’t that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.
If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov’s robots on Solaria, who act in accordance with the First Law, but have ‘human’ redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you’re doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we’re used to working with, damned hard as a general problem.
Beats me, but I don’t see how a system that’s not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won’t alter them. If there’s some reason I should rely on it not to do so, I’d love an explanation (or pointer to an explanation) of that reason.
Certainly, I have no confidence that I’m architected so that epistemic insights can’t alter whatever it is in my brain that we’re talking about when we talk about my “terminal values.”