Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI’s future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI’s desire to only ever maximize its current utility function.
I understand the point you are making about value stability.
With our current understanding, an AI with this architecture would not do anything productive. The concern isn’t that an AI with this architecture would do something bad, it’s that (in light of the fact that it would not do anything productive) you wouldn’t build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).
With our current understanding, an AI with this architecture would not do anything productive.
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in “type ‘hi’” or “buy an apple” being preceded by “prove this action maintains the invariant ‘goal stability’”.
It seems like such an architecture wouldn’t do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)
If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn’t result in a call to the (non-existing) “prove action isn’t self-modifying in a goal shifting way”.
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
That or it takes actions changing itself without caring that they would make it worse because it doesn’t know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize ‘never modify itself’.
I understand the point you are making about value stability.
With our current understanding, an AI with this architecture would not do anything productive. The concern isn’t that an AI with this architecture would do something bad, it’s that (in light of the fact that it would not do anything productive) you wouldn’t build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn’t dare do anything at all? (If (false), disregard the following. Return 0;).
If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in “type ‘hi’” or “buy an apple” being preceded by “prove this action maintains the invariant ‘goal stability’”.
It seems like such an architecture wouldn’t do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)
If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn’t result in a call to the (non-existing) “prove action isn’t self-modifying in a goal shifting way”.
That or it takes actions changing itself without caring that they would make it worse because it doesn’t know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize ‘never modify itself’.