Agreed. When I wrote U(π) I meant it as shorthand for U(x)|π, though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies.
That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.
Agreed. When I wrote U(π) I meant it as shorthand for U(x)|π, though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.