Oh that’s really interesting. I did a dive into theory of the firm research a couple years ago (mainly interested in applying it to alignment and subagent models) and came out with totally different takeaways. My takeaway was that the difficulty of credit assignment is a major limiting factor (and in particular this led to thinking about Incentive Design with Imperfect Credit Assignment, which in turn led to my current formulation of the Pointers Problem).
Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren’t necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.
For instance, a price mechanism doesn’t just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn’t need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.
I’m still not getting a good picture of what your thinking is on this. Seems like the inferential gap is wider than you’re expecting? Can you go into more details, and maybe include an example?
Memetics example: in the vanilla HCH tree, some agent way down the tree ignores their original task and returns an answer which says “the top-level question asker urgently needs to know X!” followed by some argument. And that sort of argument, if it has high memetic fitness (independent of whether it’s correct), gets passed all the way back up the tree. The higher the memetic fitness, the further it propagates.
And if we have an exponentially large tree, with this sort of thing being generated a nontrivial fraction of the time, then there will be lots of these things generated. And there will be a selection process as more-memetically-fit messages get passed up, collide with each other, and people have to choose which ones to pass further up. What pops out at the top is, potentially, very-highly-optimized memes drawn from an exponentially large search space.
And of course this all applies even if the individual agents are all well-intentioned and trying their best. As with “unconscious economics”, it’s the selection pressures which dominate, not the individuals’ intentions.
My takeaway was that the difficulty of credit assignment is a major limiting factor
With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.) It may well be the case that even if you removed this issue, credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)? If you know of some theory/math/model that says that credit assignment would be a big problem with HCH, why not talk about that instead?
If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies (“transaction costs”) would go away...
...With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.)
I’m going to jump in briefly to respond on one line of reasoning. John says the following, and I’d like to just give two examples from my own life of it.
Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren’t necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.
For instance, a price mechanism doesn’t just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn’t need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.
Microcovid Tax
In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn’t do. We couldorder take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out.
We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone’s cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates).
This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn’t punish you or tell you off, we just charged you the microcovid tax.
This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else’s interests.
Paying for Resources
Sometimes within the Lightcone team, when people wish to make bids on others’ resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they’re going to get something done, or otherwise bid for a use of the other teammate’s resources, it’s common enough for someone to state a price, and then the action only goes through if both parties agree to a trade.
I don’t think this is because we all have different goals. I think it’s primarily because it’s genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee.
On some occasions I’m bidding for something that seems clearly to me to be the right call, but when the person is asked how much they’d need to be paid in order to make it worth it, they give a much higher number, and it turns out there was a hidden cost I was not modeling.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
In sum
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
I am unclear to what extent this is a crux for the whole issue, but it does seem to me that insofar as Wei Dai believes (these are my words) “agents bending the credit-assignment toward selfish goals is the primary reason that credit assignment is difficult and HCH resolves it by having arbitrary many copies of the same (self-aligned) individual”, this is false.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor
Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.
It may well be the case that even if you removed [incentive to distort the credit assignment], credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)?
Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I’ve worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I’ll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It’s hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).
And, to be clear, I’m not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it’s reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so.
Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
I’d be interested in your thoughts on [Humans-in-a-science-lab consulting HCH], for questions where we expect that suitable empirical experiments could be run on a significant proportion of subquestions. It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.
Would you still expect this to go badly wrong (assume you get to pick the humans)? If so, would you expect sufficiently large civilizations to be crippled through memetic selection by default? If [yes, no], what do you see as the important differences?
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms...
I don’t think it’s a gap in economic theory in general: pretty sure I’ve heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals—only on “What should x cost in context y?” being a question whose answer depends on the entire system.
It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.
Would you still expect this to go badly wrong (assume you get to pick the humans)? If [yes, no], what do you see as the important differences?
Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just… refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they’re willing to test it out, but they don’t expect it to work, and it’s likely that their output is something like the string “lol nope”. I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them.
Now, there’s a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn’t talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine—they’d form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz.
Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That’s not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel.
I do not think memetic selection in particular would cripple those Johns, because that’s exactly the sort of thing they’d be on the lookout for. But I’m not confident of that. And I’d be a lot more pessimistic about the vast majority of other people. (I do expect that most people think a bureaucracy/society of themselves would work better than the bureaucracies/societies we have, and I expect that at least a majority and probably a large majority are wrong about that, because bureaucracies are generally made of median-ish people. So I am very suspicious of my inner simulator saying “well, if it was a bunch of copies of John Wentworth, they would know to avoid the failure modes which mess up real-world bureaucracies/societies”. Most people probably think that, and most people are probably wrong about it.)
I do think our current civilization is crippled by memetic selection to pretty significant extent. (I mean, that’s not the only way to frame it or the only piece, but it’s a correct frame for a large piece.)
I don’t think it’s a gap in economic theory in general: pretty sure I’ve heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals—only on “What should x cost in context y?” being a question whose answer depends on the entire system.
Economists do talk about that sort of thing, but I don’t usually see it in their math. Of course we can get e.g. implied prices for any pareto-optimal system, but I don’t know of math saying that systems will end up using those implied prices internally.
Interesting, thanks. This makes sense to me. I do think strong-HCH can support the ”...more like a simulated society...” stuff in some sense—which is to say that it can be supported so long as we can rely on individual Hs to robustly implement the necessary pointer passing (which, to be fair, we can’t).
To add to your “tree of John Wentworths”, it’s worth noting that H doesn’t need to be an individual human—so we could have our H be e.g. {John Wentworth, Eliezer Yudkowsky, Paul Christiano, Wei Dai}, or whatever team would make you more optimistic about lack of memetic disaster. (we also wouldn’t need to use the same H at every level)
Yeah, at some point we’re basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.
Oh that’s really interesting. I did a dive into theory of the firm research a couple years ago (mainly interested in applying it to alignment and subagent models) and came out with totally different takeaways. My takeaway was that the difficulty of credit assignment is a major limiting factor (and in particular this led to thinking about Incentive Design with Imperfect Credit Assignment, which in turn led to my current formulation of the Pointers Problem).
Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren’t necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.
For instance, a price mechanism doesn’t just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn’t need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.
Memetics example: in the vanilla HCH tree, some agent way down the tree ignores their original task and returns an answer which says “the top-level question asker urgently needs to know X!” followed by some argument. And that sort of argument, if it has high memetic fitness (independent of whether it’s correct), gets passed all the way back up the tree. The higher the memetic fitness, the further it propagates.
And if we have an exponentially large tree, with this sort of thing being generated a nontrivial fraction of the time, then there will be lots of these things generated. And there will be a selection process as more-memetically-fit messages get passed up, collide with each other, and people have to choose which ones to pass further up. What pops out at the top is, potentially, very-highly-optimized memes drawn from an exponentially large search space.
And of course this all applies even if the individual agents are all well-intentioned and trying their best. As with “unconscious economics”, it’s the selection pressures which dominate, not the individuals’ intentions.
With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.) It may well be the case that even if you removed this issue, credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)? If you know of some theory/math/model that says that credit assignment would be a big problem with HCH, why not talk about that instead?
Wei Dai says:
I’m going to jump in briefly to respond on one line of reasoning. John says the following, and I’d like to just give two examples from my own life of it.
Microcovid Tax
In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn’t do. We could order take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out.
We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone’s cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates).
This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn’t punish you or tell you off, we just charged you the microcovid tax.
This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else’s interests.
Paying for Resources
Sometimes within the Lightcone team, when people wish to make bids on others’ resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they’re going to get something done, or otherwise bid for a use of the other teammate’s resources, it’s common enough for someone to state a price, and then the action only goes through if both parties agree to a trade.
I don’t think this is because we all have different goals. I think it’s primarily because it’s genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee.
On some occasions I’m bidding for something that seems clearly to me to be the right call, but when the person is asked how much they’d need to be paid in order to make it worth it, they give a much higher number, and it turns out there was a hidden cost I was not modeling.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
In sum
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
I am unclear to what extent this is a crux for the whole issue, but it does seem to me that insofar as Wei Dai believes (these are my words) “agents bending the credit-assignment toward selfish goals is the primary reason that credit assignment is difficult and HCH resolves it by having arbitrary many copies of the same (self-aligned) individual”, this is false.
I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
(I have added the point I wanted to add to this conversation, and will tap out now.)
Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.
Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I’ve worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I’ll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It’s hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).
And, to be clear, I’m not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it’s reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so.
Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
I’d be interested in your thoughts on [Humans-in-a-science-lab consulting HCH], for questions where we expect that suitable empirical experiments could be run on a significant proportion of subquestions. It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.
Would you still expect this to go badly wrong (assume you get to pick the humans)? If so, would you expect sufficiently large civilizations to be crippled through memetic selection by default? If [yes, no], what do you see as the important differences?
I don’t think it’s a gap in economic theory in general: pretty sure I’ve heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals—only on “What should x cost in context y?” being a question whose answer depends on the entire system.
Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just… refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they’re willing to test it out, but they don’t expect it to work, and it’s likely that their output is something like the string “lol nope”. I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them.
Now, there’s a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn’t talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine—they’d form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz.
Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That’s not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel.
I do not think memetic selection in particular would cripple those Johns, because that’s exactly the sort of thing they’d be on the lookout for. But I’m not confident of that. And I’d be a lot more pessimistic about the vast majority of other people. (I do expect that most people think a bureaucracy/society of themselves would work better than the bureaucracies/societies we have, and I expect that at least a majority and probably a large majority are wrong about that, because bureaucracies are generally made of median-ish people. So I am very suspicious of my inner simulator saying “well, if it was a bunch of copies of John Wentworth, they would know to avoid the failure modes which mess up real-world bureaucracies/societies”. Most people probably think that, and most people are probably wrong about it.)
I do think our current civilization is crippled by memetic selection to pretty significant extent. (I mean, that’s not the only way to frame it or the only piece, but it’s a correct frame for a large piece.)
Economists do talk about that sort of thing, but I don’t usually see it in their math. Of course we can get e.g. implied prices for any pareto-optimal system, but I don’t know of math saying that systems will end up using those implied prices internally.
Interesting, thanks. This makes sense to me.
I do think strong-HCH can support the ”...more like a simulated society...” stuff in some sense—which is to say that it can be supported so long as we can rely on individual Hs to robustly implement the necessary pointer passing (which, to be fair, we can’t).
To add to your “tree of John Wentworths”, it’s worth noting that H doesn’t need to be an individual human—so we could have our H be e.g. {John Wentworth, Eliezer Yudkowsky, Paul Christiano, Wei Dai}, or whatever team would make you more optimistic about lack of memetic disaster. (we also wouldn’t need to use the same H at every level)
Yeah, at some point we’re basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.