Existing models of agency from fields like reinforcement learning and game theory don’t seem up to the job, so trying to develop better ones might pay off.
One account of why our usual models of agency aren’t up to the job is the Embedded Agency sequence—the usual models assume agents are unchanging, indivisible entities which interact with their environments through predefined channels, but real-world agents are a part of their environment. The sequence lists identifies four rough categories of problems that arise when we switch to trying to model embedded agents, explained in terms of Marcus Hutter’s model of the theoretically perfect reinforcement learning agent, AIXI.
Reinforcement learning and game theory are mathematical formalisms/frameworks (and some associated algorithms, such as many specific RL learning algorithms) But they are not science in themselves. “Embedded Agency” basically says “Let’s do cognitive science rather than develop mathematical formalisms in isolation from science”.
Then, there is a question of whether we want to predict something about any intelligent[1] systems (extremely general theory of cognition/agency) so that our predictions (or ensuing process frameworks of alignment) are robust to paradigm shifts in ML/AI, perhaps even right during the recursive self-improvement phase, or we want to prove something about intelligent systems engineered in a particular way.
For the first purpose, I know of three theories/frameworks that are probably not worse than any of the theories that you mentioned:
Free Energy Principle, Ramstead et al., 2023 (the latest and most up-to-date overview)
Maximal Coding Rate Reduction Principle (MCR^2), Ma et al., 2022
For the second purpose, there is nothing wrong with building an AI as an RL system (for example) and then basing a process framework of alignment (corrigibility, control, etc.) of it exactly on the RL formalism because the system was specifically built to conform to it. From this perspective, RL, game theory, control theory, H-JEPA, Constitutional AI for LLMs, and many other theories, formalisms, architectures, and algorithms will be instrumental to developing a safe AI in practice. Cf. “For alignment, we should simultaneously use multiple theories of cognition and value”.
Often these theories extend to describe not just ‘intelligent’, whatever that means, but any non-equilibrium, adaptive systems (or, postulate that any non-equilibrium system is intelligent, in some sense).
Reinforcement learning and game theory are mathematical formalisms/frameworks (and some associated algorithms, such as many specific RL learning algorithms) But they are not science in themselves. “Embedded Agency” basically says “Let’s do cognitive science rather than develop mathematical formalisms in isolation from science”.
Then, there is a question of whether we want to predict something about any intelligent[1] systems (extremely general theory of cognition/agency) so that our predictions (or ensuing process frameworks of alignment) are robust to paradigm shifts in ML/AI, perhaps even right during the recursive self-improvement phase, or we want to prove something about intelligent systems engineered in a particular way.
For the first purpose, I know of three theories/frameworks that are probably not worse than any of the theories that you mentioned:
Free Energy Principle, Ramstead et al., 2023 (the latest and most up-to-date overview)
Thermodynamic ML, Boyd et al., 2022
Maximal Coding Rate Reduction Principle (MCR^2), Ma et al., 2022
For the second purpose, there is nothing wrong with building an AI as an RL system (for example) and then basing a process framework of alignment (corrigibility, control, etc.) of it exactly on the RL formalism because the system was specifically built to conform to it. From this perspective, RL, game theory, control theory, H-JEPA, Constitutional AI for LLMs, and many other theories, formalisms, architectures, and algorithms will be instrumental to developing a safe AI in practice. Cf. “For alignment, we should simultaneously use multiple theories of cognition and value”.
Often these theories extend to describe not just ‘intelligent’, whatever that means, but any non-equilibrium, adaptive systems (or, postulate that any non-equilibrium system is intelligent, in some sense).