Why are we so invested in solving the Löb problem in the first place?
In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until it had solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?
In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won’t risk changing itself until it has come up with its own version of staying reflectively consistent. If it did not protect its own utility function in such a way, that would be such a fundamental flaw that having solved Löb’s problem wouldn’t matter, it would just propagate the initial state of the AGI not caring about preserving its utility function.
It certainly would be nice having solved that problem in advance, but since resources are limited …
edit: If the anwer is along the lines of “we need to have Löb’s problem solved in order to include ‘reflectively consistent under self-modication’ in the AGI’s utility function in a well-defined way” I’d say “Doesn’t the AGI already implicitly follow that goal, since the best way to adhere to utility function U1 is to keep utility function U1 unchanged?” This may not be true, maybe without having solved Löb’s problem, AGI’s may end up wireheading themselves.
It’s somewhat tricky to separate “actions which might change my utility function” from “actions”. Gandhi might not want the murder pill, but should he eat eggs? They have cholesterol that can be metabolized into testosterone which can influence aggression. Is that a sufficiently small effect?
That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.
That is not clear to me. If I haven’t yet worked out how to architect myself such that my values remain fixed, I’m not sure on what basis I can be confident that observing or inferring new things about the world won’t alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.
How would any epistemic insights change the terminal values you’d want to maximize? They’d change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn’t that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.
If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov’s robots on Solaria, who act in accordance with the First Law, but have ‘human’ redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you’re doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we’re used to working with, damned hard as a general problem.
Beats me, but I don’t see how a system that’s not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won’t alter them. If there’s some reason I should rely on it not to do so, I’d love an explanation (or pointer to an explanation) of that reason.
Certainly, I have no confidence that I’m architected so that epistemic insights can’t alter whatever it is in my brain that we’re talking about when we talk about my “terminal values.”
The trouble is basically: there is a good chance we can build systems that—in practice—do self-modification quite well, while not yet understanding any formalism that can capture notions like value stability. For example, see evolution. So if we want to minimize the chance of doing that, one thing to do is to first develop a formalism in which goal stability makes sense.
(Goal stability is one kind of desirable notion out of many. The general principle is: if you don’t understand how or why something works, it’s reasonably likely to not do quite what you want it to do. If you are trying to build an X and want to make sure you understand how it works, a sensible first step is to try and develop an account of how X could exist at all.)
Evolution isn’t an AGI, if we’re talking not about “systems that do self-modification” in general, but an “AGI that could do self modification”, does it need to initially contain an additional, explicit formalism to capture notions like value stability? Failing to have such a formalism, would it then not care what happens to its own utility function?
Randomizing itself in unpredictable ways would be counter to the AGI’s current utility function. Any of the actions the AI takes is by definition intended to serve its current utility function—even if that intent seems delayed, e.g. when the action is taking measurements in order to build a more accurate model, or self-modifying to gain more future optimizing power.
Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI’s future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI’s desire to only ever maximize its current utility function.
Conclusion: The formalism that saves the AGI from unwanted self-modification is its desire to fulfill its current utility function. The AGI would be motivated to develop formalisms that allow it to self modify to better optimize its current utility function in the future, since that would maximize its current utility function better (what a one-trick-pony!).
Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI’s future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI’s desire to only ever maximize its current utility function.
I understand the point you are making about value stability.
With our current understanding, an AI with this architecture would not do anything productive. The concern isn’t that an AI with this architecture would do something bad, it’s that (in light of the fact that it would not do anything productive) you wouldn’t build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).
With our current understanding, an AI with this architecture would not do anything productive.
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in “type ‘hi’” or “buy an apple” being preceded by “prove this action maintains the invariant ‘goal stability’”.
It seems like such an architecture wouldn’t do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)
If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn’t result in a call to the (non-existing) “prove action isn’t self-modifying in a goal shifting way”.
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
That or it takes actions changing itself without caring that they would make it worse because it doesn’t know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize ‘never modify itself’.
What if, in building a non-Löb-compliant AI, you’ve already failed to give it part of your inference ability / trust-in-math / whatever-you-call-it? Even if the AI figures out how to not lose any more, that doesn’t mean it’s going to get back the part you missed.
Possibly related question: Why try to solve decision theory, rather than just using CDT and let it figure out what the right decision theory is? Because CDT uses its own impoverished notion of “consequences” when deriving what the consequence of switching decision theories is.
(Bit off-topic:)
Why are we so invested in solving the Löb problem in the first place?
In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until it had solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?
In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won’t risk changing itself until it has come up with its own version of staying reflectively consistent. If it did not protect its own utility function in such a way, that would be such a fundamental flaw that having solved Löb’s problem wouldn’t matter, it would just propagate the initial state of the AGI not caring about preserving its utility function.
It certainly would be nice having solved that problem in advance, but since resources are limited …
edit: If the anwer is along the lines of “we need to have Löb’s problem solved in order to include ‘reflectively consistent under self-modication’ in the AGI’s utility function in a well-defined way” I’d say “Doesn’t the AGI already implicitly follow that goal, since the best way to adhere to utility function U1 is to keep utility function U1 unchanged?” This may not be true, maybe without having solved Löb’s problem, AGI’s may end up wireheading themselves.
It’s somewhat tricky to separate “actions which might change my utility function” from “actions”. Gandhi might not want the murder pill, but should he eat eggs? They have cholesterol that can be metabolized into testosterone which can influence aggression. Is that a sufficiently small effect?
Any kind of self-modification would be out of the question until the AGI has solved the problem of keeping its utility function intact.
That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.
That is not clear to me. If I haven’t yet worked out how to architect myself such that my values remain fixed, I’m not sure on what basis I can be confident that observing or inferring new things about the world won’t alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.
How would any epistemic insights change the terminal values you’d want to maximize? They’d change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn’t that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.
If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov’s robots on Solaria, who act in accordance with the First Law, but have ‘human’ redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you’re doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we’re used to working with, damned hard as a general problem.
Beats me, but I don’t see how a system that’s not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won’t alter them. If there’s some reason I should rely on it not to do so, I’d love an explanation (or pointer to an explanation) of that reason.
Certainly, I have no confidence that I’m architected so that epistemic insights can’t alter whatever it is in my brain that we’re talking about when we talk about my “terminal values.”
The trouble is basically: there is a good chance we can build systems that—in practice—do self-modification quite well, while not yet understanding any formalism that can capture notions like value stability. For example, see evolution. So if we want to minimize the chance of doing that, one thing to do is to first develop a formalism in which goal stability makes sense.
(Goal stability is one kind of desirable notion out of many. The general principle is: if you don’t understand how or why something works, it’s reasonably likely to not do quite what you want it to do. If you are trying to build an X and want to make sure you understand how it works, a sensible first step is to try and develop an account of how X could exist at all.)
Evolution isn’t an AGI, if we’re talking not about “systems that do self-modification” in general, but an “AGI that could do self modification”, does it need to initially contain an additional, explicit formalism to capture notions like value stability? Failing to have such a formalism, would it then not care what happens to its own utility function?
Randomizing itself in unpredictable ways would be counter to the AGI’s current utility function. Any of the actions the AI takes is by definition intended to serve its current utility function—even if that intent seems delayed, e.g. when the action is taking measurements in order to build a more accurate model, or self-modifying to gain more future optimizing power.
Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI’s future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI’s desire to only ever maximize its current utility function.
Conclusion: The formalism that saves the AGI from unwanted self-modification is its desire to fulfill its current utility function. The AGI would be motivated to develop formalisms that allow it to self modify to better optimize its current utility function in the future, since that would maximize its current utility function better (what a one-trick-pony!).
I understand the point you are making about value stability.
With our current understanding, an AI with this architecture would not do anything productive. The concern isn’t that an AI with this architecture would do something bad, it’s that (in light of the fact that it would not do anything productive) you wouldn’t build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in “type ‘hi’” or “buy an apple” being preceded by “prove this action maintains the invariant ‘goal stability’”.
It seems like such an architecture wouldn’t do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)
If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn’t result in a call to the (non-existing) “prove action isn’t self-modifying in a goal shifting way”.
That or it takes actions changing itself without caring that they would make it worse because it doesn’t know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize ‘never modify itself’.
What if, in building a non-Löb-compliant AI, you’ve already failed to give it part of your inference ability / trust-in-math / whatever-you-call-it? Even if the AI figures out how to not lose any more, that doesn’t mean it’s going to get back the part you missed.
Possibly related question: Why try to solve decision theory, rather than just using CDT and let it figure out what the right decision theory is? Because CDT uses its own impoverished notion of “consequences” when deriving what the consequence of switching decision theories is.