The type signature of goals is the overarching topic to which this post contributes. It can manifest in a lot of different ways in specific applications:
What’s the type signature of human values?
What structure types should systems biologists or microscope AI researchers look for in supposedly-goal-oriented biological or ML systems?
Will AI be “goal-oriented”, and what would be the type signature of its “goal”?
If we want to “align AI with human values”, build ML interpretability tools, etc, then that’s going to be pretty tough if we don’t even know what-kind-of-thing we’re looking for. When we don’t know what-kind-of-things we’re talking about, our analysis risks being completely confused—like trying to subtract 3 from “cookie”, or measure the angular momentum of life satisfaction.
The traditional go-to answer for these type signatures is “expected utility over world-states”: we have a “utility function” mapping world-states (inputs) to real numbers (outputs), and we average utility values over some distribution on world-states.
This go-to answer feels confused, in multiple ways, both in theory and in practice. On the theory side, this review outlines some problems. On the practice side, “Why Subagents?” mentions that markets are not expected utility maximizers, Kaj’s sequence talks about many of the ways in which humans seem to be made of subagents rather than one monolithic utility maximizer, and of course there are also other problems which aren’t about subagents.
What Part Of The Problem Do Subagents Address?
What are the inputs to human values, and what are the outputs? That’s a reasonable formulation of the type-signature question, in the context of human values.
I consider the “utility function on world-states” answer confused on both parts of the question—inputs and outputs. Subagents address half of that problem: the outputs half. “Why Subagents?” argues that the outputs should be, not one real number, but a set of real numbers, representing the utilities of subagents.
(The other half of the problem—inputs to values—I consider harder, and it’s the inputs part which is involved in the most “conceptually difficult” problems of alignment. My current best answer is that the inputs to human values are latent variables in a human’s world-model. This provides a clean and intuitive formalization of hairy conceptual/philosophical problems in alignment; see here for more on that.)
Dangling Threads
There’s still a lot of dangling threads. On the theory side, “Why Subagents?” only talks about deterministic preferences, which is dramatically easier than the probabilistic version we really care about. Ideally, we’d like a coherence theorem. On the empirical side, we’d really like to know if there are subagents embedded in e.g. trained neural networks or bacteria. These empirical investigations will eventually be the real test, but we probably need more work on theory side before we’re ready for the empirical component. Subagents are only half of the type signature, and it’s a coherence theorem (or something analogous) which would tell us how to look for these structures embedded in real-world systems.
What’s the type signature of goals?
The type signature of goals is the overarching topic to which this post contributes. It can manifest in a lot of different ways in specific applications:
What’s the type signature of human values?
What structure types should systems biologists or microscope AI researchers look for in supposedly-goal-oriented biological or ML systems?
Will AI be “goal-oriented”, and what would be the type signature of its “goal”?
If we want to “align AI with human values”, build ML interpretability tools, etc, then that’s going to be pretty tough if we don’t even know what-kind-of-thing we’re looking for. When we don’t know what-kind-of-things we’re talking about, our analysis risks being completely confused—like trying to subtract 3 from “cookie”, or measure the angular momentum of life satisfaction.
The traditional go-to answer for these type signatures is “expected utility over world-states”: we have a “utility function” mapping world-states (inputs) to real numbers (outputs), and we average utility values over some distribution on world-states.
This go-to answer feels confused, in multiple ways, both in theory and in practice. On the theory side, this review outlines some problems. On the practice side, “Why Subagents?” mentions that markets are not expected utility maximizers, Kaj’s sequence talks about many of the ways in which humans seem to be made of subagents rather than one monolithic utility maximizer, and of course there are also other problems which aren’t about subagents.
What Part Of The Problem Do Subagents Address?
What are the inputs to human values, and what are the outputs? That’s a reasonable formulation of the type-signature question, in the context of human values.
I consider the “utility function on world-states” answer confused on both parts of the question—inputs and outputs. Subagents address half of that problem: the outputs half. “Why Subagents?” argues that the outputs should be, not one real number, but a set of real numbers, representing the utilities of subagents.
(The other half of the problem—inputs to values—I consider harder, and it’s the inputs part which is involved in the most “conceptually difficult” problems of alignment. My current best answer is that the inputs to human values are latent variables in a human’s world-model. This provides a clean and intuitive formalization of hairy conceptual/philosophical problems in alignment; see here for more on that.)
Dangling Threads
There’s still a lot of dangling threads. On the theory side, “Why Subagents?” only talks about deterministic preferences, which is dramatically easier than the probabilistic version we really care about. Ideally, we’d like a coherence theorem. On the empirical side, we’d really like to know if there are subagents embedded in e.g. trained neural networks or bacteria. These empirical investigations will eventually be the real test, but we probably need more work on theory side before we’re ready for the empirical component. Subagents are only half of the type signature, and it’s a coherence theorem (or something analogous) which would tell us how to look for these structures embedded in real-world systems.