If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies (“transaction costs”) would go away...
...With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.)
I’m going to jump in briefly to respond on one line of reasoning. John says the following, and I’d like to just give two examples from my own life of it.
Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren’t necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.
For instance, a price mechanism doesn’t just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn’t need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.
Microcovid Tax
In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn’t do. We couldorder take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out.
We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone’s cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates).
This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn’t punish you or tell you off, we just charged you the microcovid tax.
This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else’s interests.
Paying for Resources
Sometimes within the Lightcone team, when people wish to make bids on others’ resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they’re going to get something done, or otherwise bid for a use of the other teammate’s resources, it’s common enough for someone to state a price, and then the action only goes through if both parties agree to a trade.
I don’t think this is because we all have different goals. I think it’s primarily because it’s genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee.
On some occasions I’m bidding for something that seems clearly to me to be the right call, but when the person is asked how much they’d need to be paid in order to make it worth it, they give a much higher number, and it turns out there was a hidden cost I was not modeling.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
In sum
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
I am unclear to what extent this is a crux for the whole issue, but it does seem to me that insofar as Wei Dai believes (these are my words) “agents bending the credit-assignment toward selfish goals is the primary reason that credit assignment is difficult and HCH resolves it by having arbitrary many copies of the same (self-aligned) individual”, this is false.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor
Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.
Wei Dai says:
I’m going to jump in briefly to respond on one line of reasoning. John says the following, and I’d like to just give two examples from my own life of it.
Microcovid Tax
In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn’t do. We could order take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out.
We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone’s cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates).
This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn’t punish you or tell you off, we just charged you the microcovid tax.
This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else’s interests.
Paying for Resources
Sometimes within the Lightcone team, when people wish to make bids on others’ resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they’re going to get something done, or otherwise bid for a use of the other teammate’s resources, it’s common enough for someone to state a price, and then the action only goes through if both parties agree to a trade.
I don’t think this is because we all have different goals. I think it’s primarily because it’s genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee.
On some occasions I’m bidding for something that seems clearly to me to be the right call, but when the person is asked how much they’d need to be paid in order to make it worth it, they give a much higher number, and it turns out there was a hidden cost I was not modeling.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
In sum
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
I am unclear to what extent this is a crux for the whole issue, but it does seem to me that insofar as Wei Dai believes (these are my words) “agents bending the credit-assignment toward selfish goals is the primary reason that credit assignment is difficult and HCH resolves it by having arbitrary many copies of the same (self-aligned) individual”, this is false.
I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
(I have added the point I wanted to add to this conversation, and will tap out now.)
Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.