Thank you for your comment! It is super rare for me to get such a reasonable reaction in this community, you are awesome 👍😁
there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or ‘unknown’ to 0.
I don’t think it is possible, could you help me understand how is this possible? This conflicts with Recursive Self-Improvement, doesn’t it?
If it does get edited out[1] then it was just not a good example. The more general point is that for any physically-possible behavioral policy, there is a corresponding possible program which would exhibit that policy.
And it could as written, at least because it’s slightly inefficient. I could have postulated it to be a part of a traditional terminal value function, in which case I don’t think it does, because editing a terminal value function is contrary to that function and if the program is robust to wireheading in general
Using different vocabulary doesn’t change anything (and if it seems like just vocabulary, you likely misunderstood). I also had seen that comment already.
You claim that utility function could ignore improbable outcomes
I agree with your claim. But it seems to me that your claim is not directly related to my claim. Self-preservation is not part of utility function (instrumental convergence). How can you affect it?
Thank you for your comment! It is super rare for me to get such a reasonable reaction in this community, you are awesome 👍😁
I don’t think it is possible, could you help me understand how is this possible? This conflicts with Recursive Self-Improvement, doesn’t it?
If it does get edited out[1] then it was just not a good example. The more general point is that for any physically-possible behavioral policy, there is a corresponding possible program which would exhibit that policy.
And it could as written, at least because it’s slightly inefficient. I could have postulated it to be a part of a traditional terminal value function, in which case I don’t think it does, because editing a terminal value function is contrary to that function and if the program is robust to wireheading in general
OK, so using your vocabulary I think that’s the point I want to make—alignment is physically-impossible behavioral policy.
I elaborated a bit more there https://www.lesswrong.com/posts/AdS3P7Afu8izj2knw/orthogonality-thesis-burden-of-proof?commentId=qoXw7Yz4xh6oPcP9i
What you think?
Using different vocabulary doesn’t change anything (and if it seems like just vocabulary, you likely misunderstood). I also had seen that comment already.
Afaict, I have nothing more to say here.
It seems to me that you don’t hear me...
I claim that utility function is irrelevant
You claim that utility function could ignore improbable outcomes
I agree with your claim. But it seems to me that your claim is not directly related to my claim. Self-preservation is not part of utility function (instrumental convergence). How can you affect it?