Stop thinking about whether the reward is “representing what we want”, or focusing overmuch on whether agents will “optimize the reward function.” Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI’s internal computations and decision-making?
Quick summary of a major takeaway from Reward is not the optimization target:
Stop thinking about whether the reward is “representing what we want”, or focusing overmuch on whether agents will “optimize the reward function.” Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI’s internal computations and decision-making?
Are there different classes of learning systems that optimize for the reward in different ways?
Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.