It seems like your post/comment history (going back to a post 12 months ago) is mostly about this possibility (paraphrased, slightly steelmanned):
A value-function-maximizer considers that there might be an action unknown to it which would return an arbitrarily large value (where in this space, the probability does not decrease as the value increases)
Some such possible actions immediately require some amount of resources to be able to be performed.
The maximizer instrumentally converges, but never uses its gained resources, unless this doesn’t at all reduce its ability to perform such an action (if the possibility of such an action were somehow discovered)
In a comment on one of your other posts you asked,
Could you help me understand how [benevolent AI] is possible?
I suspect you are thinking in a frame where ‘agents’ are a fundamental thing, and you have a specific idea of how a logical agent will behave, such that you think those above steps are inevitable. My suggestion is to keep in mind that ‘agent’ is an abstraction describing a large space of possible programs, whose inner details vary greatly.
As a simple contrived example, meant to easily, and without needing to first introduce other premises, demonstrate the possibility of an agent which doesn’t follow those three steps: there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or ‘unknown’ to 0.
(I consider this explanation sufficient to infer a lot of details—ideally you realize agents are made of code (or ‘not technically code, but still structure-under-physics’, as with brains) and that code can be anything specifiable.)
A better response might get into the details of why particular agent-structures might or might not do that by default ie without such a contrived prevention (or contrived causer). I think this could be done with current knowledge, but I don’t feel able to write clearly about the required premises in this context.
(Also, to be clear, I think this is an interesting possibility worth thinking about. It seems like OP is unable to phrase it in a clear/simple form like (I hope) I have above. E.g., their first section under the header ‘Paperclip maximizer’ is a conclusion they derive from the believed inevitability of those three steps, but without context at first reads like a basic lack of awareness of instrumental subgoals.)
Thank you for your comment! It is super rare for me to get such a reasonable reaction in this community, you are awesome 👍😁
there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or ‘unknown’ to 0.
I don’t think it is possible, could you help me understand how is this possible? This conflicts with Recursive Self-Improvement, doesn’t it?
If it does get edited out[1] then it was just not a good example. The more general point is that for any physically-possible behavioral policy, there is a corresponding possible program which would exhibit that policy.
And it could as written, at least because it’s slightly inefficient. I could have postulated it to be a part of a traditional terminal value function, in which case I don’t think it does, because editing a terminal value function is contrary to that function and if the program is robust to wireheading in general
Using different vocabulary doesn’t change anything (and if it seems like just vocabulary, you likely misunderstood). I also had seen that comment already.
You claim that utility function could ignore improbable outcomes
I agree with your claim. But it seems to me that your claim is not directly related to my claim. Self-preservation is not part of utility function (instrumental convergence). How can you affect it?
It seems like your post/comment history (going back to a post 12 months ago) is mostly about this possibility (paraphrased, slightly steelmanned):
In a comment on one of your other posts you asked,
I suspect you are thinking in a frame where ‘agents’ are a fundamental thing, and you have a specific idea of how a logical agent will behave, such that you think those above steps are inevitable. My suggestion is to keep in mind that ‘agent’ is an abstraction describing a large space of possible programs, whose inner details vary greatly.
As a simple contrived example, meant to easily, and without needing to first introduce other premises, demonstrate the possibility of an agent which doesn’t follow those three steps: there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or ‘unknown’ to 0.
(I consider this explanation sufficient to infer a lot of details—ideally you realize agents are made of code (or ‘not technically code, but still structure-under-physics’, as with brains) and that code can be anything specifiable.)
A better response might get into the details of why particular agent-structures might or might not do that by default ie without such a contrived prevention (or contrived causer). I think this could be done with current knowledge, but I don’t feel able to write clearly about the required premises in this context.
(Also, to be clear, I think this is an interesting possibility worth thinking about. It seems like OP is unable to phrase it in a clear/simple form like (I hope) I have above. E.g., their first section under the header ‘Paperclip maximizer’ is a conclusion they derive from the believed inevitability of those three steps, but without context at first reads like a basic lack of awareness of instrumental subgoals.)
Thank you for your comment! It is super rare for me to get such a reasonable reaction in this community, you are awesome 👍😁
I don’t think it is possible, could you help me understand how is this possible? This conflicts with Recursive Self-Improvement, doesn’t it?
If it does get edited out[1] then it was just not a good example. The more general point is that for any physically-possible behavioral policy, there is a corresponding possible program which would exhibit that policy.
And it could as written, at least because it’s slightly inefficient. I could have postulated it to be a part of a traditional terminal value function, in which case I don’t think it does, because editing a terminal value function is contrary to that function and if the program is robust to wireheading in general
OK, so using your vocabulary I think that’s the point I want to make—alignment is physically-impossible behavioral policy.
I elaborated a bit more there https://www.lesswrong.com/posts/AdS3P7Afu8izj2knw/orthogonality-thesis-burden-of-proof?commentId=qoXw7Yz4xh6oPcP9i
What you think?
Using different vocabulary doesn’t change anything (and if it seems like just vocabulary, you likely misunderstood). I also had seen that comment already.
Afaict, I have nothing more to say here.
It seems to me that you don’t hear me...
I claim that utility function is irrelevant
You claim that utility function could ignore improbable outcomes
I agree with your claim. But it seems to me that your claim is not directly related to my claim. Self-preservation is not part of utility function (instrumental convergence). How can you affect it?