Do you feel your agenda will allow us to formalise the idea of “don’t hack the agent who provides your reward signal” in some way? Every attempt I’ve seen has either failed or been too restrictive.
Unless time travel is possible. I haven’t thought through the implications of time travel, but it seems sufficiently unlikely that handling that scenario is a “luxury”.
Do you feel your agenda will allow us to formalise the idea of “don’t hack the agent who provides your reward signal” in some way? Every attempt I’ve seen has either failed or been too restrictive.
The AI only considers the user’s timeline before the AI’s creation as specifying the loss function. It cannot change the past[1].
Unless time travel is possible. I haven’t thought through the implications of time travel, but it seems sufficiently unlikely that handling that scenario is a “luxury”.