I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)
One of my (TurnTrout’s) reasons for alignment optimism is that I think:
We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
(Although this amount of information depends on how much interpretability and agent-internals theory we do now)
All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
It’s crucial to get early-training value shards of which a substantial fraction are “human-compatible values” (whatever that means)
For example, if there are protect-human-shards which
reliably bid against plans where people get hurt,
steer deliberation away from such plan stubs, and
these shards are “reflectively endorsed” by the overall shard economy (i.e. the decision-making isn’t steering towards plans where the protect-human shards get removed)
If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can’t affect the ball game very much (e.g. alien abstractions, interpretability problems, can’t oversee AI’s complicated plans)
Therefore it seems very important to understand what’s going on with “shard game theory” (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?
He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they’re viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you’d want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions.
I proposed a setup like:
Maybe you could have an “action” instead of a proposition and then the action comes out as 1 or 0 depending on the a function of the market position on that action at a given time, which possibly leads to fixed points for every possible resolution.
For example, if all the traders hold a1 as YES, then a1 actually does come out as YES. And eg a trader T1 which “wants” all the even-numbered actions and T2 wants all the 10-multiple actions (a10,a20,...), they can “bargain” by bidding up each others’ actions whenever they have extra power and thereby “value handshake.”
And that over time, traders who do this should take up more and more market share relative to those who dont exploit gains from trade.
There should be a very high dependence of final trader coalition on the initial composition of market share. And it seems like some version of this should be able to model self-reflective value drift. You can think about action resolution and payout as a kind of reward event, where certain kinds of shards get reinforced. Bidding for an action which happens and leads to reward, gets reinforced (supporting traders receive payouts), and the more you support (bid for it), the more responsible your support was for the event, so the larger the strengthening.
Abram seemed to think that there might exist a nice result like “Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/Y/Z values up to [reasonably tight trader-subjective regret bound].”
What this would tell us is when trader coalitions can bargain / value handshake / self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI’s initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).
Issues include:
Somehow this has to confront Rice’s theorem for adding new traders to a coalition. What strategies would be good?
I think “inspect arbitrary new traders in arbitrary situations” is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD
The key question isn’t can we predict those value drift events, but can the coalition
EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.
Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that’s not the right frame).
Possibly this is still underspecified and the modeling assumptions can’t properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn’t be true.
ETA this doesn’t model the contextual activation of values, which is a centerpiece of shard theory.
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can’t get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.
Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there’s possibly changes which Goodhart all of the shards simultaneously. Indeed, I’d expect that to be a pretty strong default outcome.
Even on the view you advocate here (where some kind of perfection is required), “perfectly align part of the motivations” seems substantially easier than “perfectly align all of the AI’s optimization so it isn’t optimizing for anything you don’t want.”
If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there’s possibly changes which Goodhart all of the shards simultaneously. Indeed, I’d expect that to be a pretty strong default outcome.
I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.
I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)
One of my (TurnTrout’s) reasons for alignment optimism is that I think:
We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
(Although this amount of information depends on how much interpretability and agent-internals theory we do now)
All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
It’s crucial to get early-training value shards of which a substantial fraction are “human-compatible values” (whatever that means)
For example, if there are protect-human-shards which
reliably bid against plans where people get hurt,
steer deliberation away from such plan stubs, and
these shards are “reflectively endorsed” by the overall shard economy (i.e. the decision-making isn’t steering towards plans where the protect-human shards get removed)
If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can’t affect the ball game very much (e.g. alien abstractions, interpretability problems, can’t oversee AI’s complicated plans)
Therefore it seems very important to understand what’s going on with “shard game theory” (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?
He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they’re viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you’d want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions.
I proposed a setup like:
Abram seemed to think that there might exist a nice result like “Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/Y/Z values up to [reasonably tight trader-subjective regret bound].”
What this would tell us is when trader coalitions can bargain / value handshake / self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI’s initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).
Issues include:
Somehow this has to confront Rice’s theorem for adding new traders to a coalition. What strategies would be good?
I think “inspect arbitrary new traders in arbitrary situations” is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD
The key question isn’t can we predict those value drift events, but can the coalition
EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.
Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that’s not the right frame).
Possibly this is still underspecified and the modeling assumptions can’t properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn’t be true.
ETA this doesn’t model the contextual activation of values, which is a centerpiece of shard theory.
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can’t get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.
Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there’s possibly changes which Goodhart all of the shards simultaneously. Indeed, I’d expect that to be a pretty strong default outcome.
Even on the view you advocate here (where some kind of perfection is required), “perfectly align part of the motivations” seems substantially easier than “perfectly align all of the AI’s optimization so it isn’t optimizing for anything you don’t want.”
I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.