I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
I’m not sure what’s unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent’s behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?
ETA:
The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves.
I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.
I’m not sure what’s unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent’s behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?
ETA:
I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.