I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers.
I think it’s fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we’re are using the intentional stance (in other words, we’re interpreting the neural network as having goals, whether it’s using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
I’m not sure what’s unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent’s behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?
ETA:
The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves.
I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.
I think it’s fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we’re are using the intentional stance (in other words, we’re interpreting the neural network as having goals, whether it’s using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I’m not sure what’s unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent’s behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?
ETA:
I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.