Rant on Problem Factorization for Alignment
This post is the second in what is likely to become a series of uncharitable rants about alignment proposals (previously: Godzilla Strategies). In general, these posts are intended to convey my underlying intuitions. They are not intended to convey my all-things-considered, reflectively-endorsed opinions. In particular, my all-things-considered reflectively-endorsed opinions are usually more kind. But I think it is valuable to make the underlying, not-particularly-kind intuitions publicly-visible, so people can debate underlying generators directly. I apologize in advance to all the people I insult in the process.
With that in mind, let’s talk about problem factorization (a.k.a. task decomposition).
HCH
It all started with HCH, a.k.a. The Infinite Bureaucracy.
The idea of The Infinite Bureaucracy is that a human (or, in practice, human-mimicking AI) is given a problem. They only have a small amount of time to think about it and research it, but they can delegate subproblems to their underlings. The underlings likewise each have only a small amount of time, but can further delegate to their underlings, and so on down the infinite tree. So long as the humans near the top of the tree can “factorize the problem” into small, manageable pieces, the underlings should be able to get it done. (In practice, this would be implemented by training a question-answerer AI which can pass subquestions to copies of itself.)
At this point the ghost of George Orwell chimes in, not to say anything in particular, but just to scream. The ghost has a point: how on earth does an infinite bureaucracy seem like anything besides a terrible idea?
“Well,” says a proponent of the Infinite Bureaucracy, “unlike in a real bureaucracy, all the humans in the infinite bureaucracy are actually just trying to help you, rather than e.g. engaging in departmental politics.” So, ok, apparently this person has not met a lot of real-life bureaucrats. The large majority are decent people who are honestly trying to help. It is true that departmental politics are a big issue in bureaucracies, but those selection pressures apply regardless of the peoples’ intentions. And also, man, it sure does seem like Coordination is a Scarce Resource and Interfaces are a Scarce Resource and scarcity of those sorts of things sure would make bureaucracies incompetent in basically the ways bureacracies are incompetent in practice.
Debate and Other Successors
So, ok, maybe The Infinite Bureaucracy is not the right human institution to mimic. What institution can use humans to produce accurate and sensible answers to questions, robustly and reliably? Oh, I know! How about the Extremely Long Jury Trial? Y’know, because juries are, in practice, known for their extremely high reliability in producing accurate and sensible judgements!
“Well,” says the imaginary proponent, “unlike in a real Jury Trial, in the Extremely Long Jury Trial, the lawyers are both superintelligent and the arguments are so long that no human could ever possibility check them all the way through; the lawyers instead read each other’s arguments and then try to point the Jury at the particular places where the holes are in the opponent’s argument without going through the whole thing end-to-end.”
I rest my case.
Anyway, HCH and debate have since been followed by various other successors, which improve on their predecessors mostly by adding more boxes and arrows and loops and sometimes even multiple colors of arrows to the diagram describing the setup. Presumably the strategy is to make it complicated enough that it no longer obviously corresponds to some strategy which already fails in practice, and then we can bury our heads in the sand and pretend that We Just Don’t Know whether it will work and therefore maybe it will work.
(Reminder: in general I don’t reflectively endorse everything in this post; it’s accurately conveying my underlying intuitions, not my all-things-considered judgement. That last bit in particular was probably overly harsh.)
The Ought Experiment
I have a hypothesis about problem factorization research. My guess is that, to kids fresh out of the ivory tower with minimal work experience at actual companies, it seems totally plausible that humans can factorize problems well. After all, we manufacture all sorts of things on production lines, right? Ask someone who’s worked in a non-academia cognitive job for a while (like e.g. a tech company), at a company with more than a dozen people, and they’ll be like “lolwut obviously humans don’t factorize problems well, have you ever seen an actual company?”. I’d love to test this theory, please give feedback in the comments about your own work experience and thoughts on problem factorization.
Anyway, for someone either totally ignorant of the giant onslaught of evidence provided by day-to-day economic reality, or trying to ignore the giant onslaught of evidence in order to avoid their hopes being crushed, it apparently seems like We Just Don’t Know whether humans can factorize cognitive problems well. Sort of like We Just Don’t Know whether a covid test works until after the FDA finishes its trials, even after the test has been approved in the EU ok that’s a little too harsh even for this post.
So Ought went out and tested it experimentally. (Which, sarcasm aside, was a great thing to do.)
The experiment setup: a group of people are given a Project Euler problem. The first person receives the problem, has five minutes to work on it, and records their thoughts in a google doc. The doc is then passed to the next person, who works on it for five minutes recording their thoughts in the doc, and so on down the line. (Note: I’m not sure it was 5 minutes exactly, but something like that.) As long as the humans are able to factor the problem into 5-minute-size chunks without too much overhead, they should be able to efficiently solve it this way.
So what actually happened?
The story I got from a participant is: it sucked. The google doc was mostly useless, you’d spend five minutes just trying to catch up and summarize, people constantly repeated work, and progress was mostly not cumulative. Then, eventually, one person would just ignore the google doc and manage to solve the whole problem in five minutes. (This was, supposedly, usually the same person.) So, in short, the humans utterly failed to factor the problems well, exactly as one would (very strongly) expect from seeing real-world companies in action.
This story basically matches the official write-up of the results.
So Ought said “Oops” and moved on to greener pastures lol no, last I heard Ought is still trying to figure out if better interface design and some ML integration can make problem factorization work. Which, to their credit, would be insanely valuable if they could do it.
That said, I originally heard about HCH and the then-upcoming Ought experiment from Paul Christiano in the summer of 2019. It was immediately very obvious to me that HCH was hopeless (for basically the reasons discussed here); at the time I asked Paul “So when the Ought experiments inevitably fail completely, what’s the fallback plan?”. And he basically said “back to more foundational research”. And to Paul’s credit, three years and an Ought experiment later, he’s now basically moved on to more foundational research.
Sandwiching
About a year ago, Cotra proposed a different class of problem factorization experiments: “sandwiching”. We start with some ML model which has lots of knowledge from many different fields, like GPT-n. We also have a human who has a domain-specific problem to solve (like e.g. a coding problem, or a translation to another language) but lacks the relevant domain knowledge (e.g. coding skills, or language fluency). The problem, roughly speaking, is to get the ML model and the human to work as a team, and produce an outcome at-least-as-good as a human expert in the domain. In other words, we want to factorize the “expert knowledge” and the “having a use-case” parts of the problem.
(The actual sandwiching experiment proposal adds some pieces which I claim aren’t particularly relevant to the point here.)
I love this as an experiment idea. It really nicely captures the core kind of factorization needed for factorization-based alignment to work. But Cotra makes one claim I don’t buy: that We Just Don’t Know how such experiments will turn out, or how hard sandwiching will be for cognitive problems in general. I claim that the results are very predictable, because things very much like this already happen all the time in practice.
For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn’t know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the “sandwich problem” would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don’t think we have a good solution in practice; I’d expect the expert business-owner to usually come up with a much better contract.
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn’t understand what the designer wants), versus a product designer who’s also a fluent coder and familiar with the code base. I’ve experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I’ve seen this one first-hand too.)
There’s tons of real-life sandwiching problems, and tons of economic incentive to solve them, yet we do not have good general-purpose solutions.
The Next Generation
Back in 2019, I heard Paul’s HCH proposal, heard about the Ought experiment, and concluded that this bad idea was already on track to self-correct via experimental feedback. Those are the best kind of bad ideas. I wrote up some of the relevant underlying principles (Coordination as a Scarce Resource and Interfaces as a Scarce Resource), but mostly waited for the problem to solve itself. And I think that mostly worked… for Paul.
But meanwhile, over the past year or so, the field has seen a massive influx of bright-eyed new alignment researchers fresh out of college/grad school, with minimal work experience in industry. And of course most of them don’t read through most of the enormous, undistilled, and very poorly indexed corpus of failed attempts from the past ten years. (And it probably doesn’t help that a plurality come through the AGI Safety Fundamentals course, which last time I checked had a whole section on problem factorization but, to my knowledge, didn’t even mention the Ought experiment or the massive pile of close real-world economic analogues. It does include two papers which got ok results by picking easy-to-decompose tasks and hard-coding the decompositions.) So we have a perfect recipe for people who will see problem factorization and think “oh, hey, that could maybe work!”.
If we’re lucky, hopefully some of the onslaught of bright-eyed new researchers will attempt their own experiments (like e.g. sandwiching) and manage to self-correct, but at this point new researchers are pouring in faster than any experiments are likely to proceed, so probably the number of people pursuing this particular dead end will go up over time.
- Interpretability/Tool-ness/Alignment/Corrigibility are not Composable by 8 Aug 2022 18:05 UTC; 130 points) (
- Internal independent review for language model agent alignment by 7 Jul 2023 6:54 UTC; 54 points) (
- 1 Dec 2022 23:50 UTC; 41 points) 's comment on A challenge for AGI organizations, and a challenge for readers by (
- Recommend HAIST resources for assessing the value of RLHF-related alignment research by 5 Nov 2022 20:58 UTC; 26 points) (
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 25 points) (
- AI Risk Intro 2: Solving The Problem by 22 Sep 2022 13:55 UTC; 22 points) (
- What are the known difficulties with this alignment approach? by 11 Feb 2024 22:52 UTC; 18 points) (
- AI & wisdom 3: AI effects on amortised optimisation by 29 Oct 2024 13:37 UTC; 14 points) (EA Forum;
- AI & wisdom 3: AI effects on amortised optimisation by 28 Oct 2024 21:08 UTC; 12 points) (
- AI Risk Intro 2: Solving The Problem by 24 Sep 2022 9:33 UTC; 11 points) (EA Forum;
- Alignment with argument-networks and assessment-predictions by 13 Dec 2022 2:17 UTC; 10 points) (
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 6 points) (EA Forum;
- 6 Nov 2022 0:10 UTC; 4 points) 's comment on Ulisse Mini’s Shortform by (
- 10 Mar 2023 16:58 UTC; 2 points) 's comment on Why Not Just Outsource Alignment Research To An AI? by (
- 11 Mar 2023 0:08 UTC; 1 point) 's comment on Why Not Just Outsource Alignment Research To An AI? by (
Meta: Unreflected rants (intentionally) state a one-sided, probably somewhat mistaken position. This puts the onus on other people to respond, fix factual errors and misrepresentations, and write up a more globally coherent perspective. Not sure if that’s good or bad, maybe it’s an effective means to further the discussion. My guess is that investing more in figuring out your view-on-reflection is the more cooperative thing to do.
I endorse this criticism, though I think the upsides outweigh the downsides in this case. (Specifically, the relevant upsides are (1) being able to directly discuss generators of beliefs, and (2) just directly writing up my intuitions is far less time-intensive than a view-on-reflection, to the point where I actually do it rather than never getting around to it.)
This post seems to rely too much on transferring intuitions about existing human institutions to the new (e.g. HCH) setting, where there are two big differences that could invalidate those intuitions:
Real humans all have different interests/values, even if most of them on a conscious level are trying to help.
Real humans are very costly and institutions have to economize on them. (Is coordination still a scarce resource if we can cheaply add more “coordinators”?)
In this post, you don’t explain in any detail why you think the intuitions should nevertheless transfer. I read some of the linked posts that might explain this, and couldn’t find an explanation in them either. They seem to talk about problems in human institutions, and don’t mention why the same issues might exist in new constructs such as HCH despite the differences that I mention. For example you link “those selection pressures apply regardless of the peoples’ intentions” to Unconscious Economics but it’s just not obvious to me how that post applies in the case of HCH.
The main reason it would transfer to HCH (and ~all other problem-factorization-based proposals I’ve seen) is because the individual units in those proposals are generally human-mimickers of some kind (similar to e.g. GPT). Indeed, the original point of HCH is to be able to solve problems beyond what an individual human can solve while training on human mimickry, in order to get the outer alignment benefits of human mimickry.
E.g. for unconscious economics in particular, the selection effects mostly apply to memetics in the HCH tree. And in versions of HCH which allow repeated calls to the same human (as Paul’s later version of the proposal does IIRC), unconscious economics applies in the more traditional ways as well.
The two differences you mention seem not-particularly-central to real-world institutional problems. In order to expect that existing problems wouldn’t transfer, based on those two differences, we’d need some argument that those two differences address the primary bottlenecks to better performance in existing institutions. (1) seems mostly-irrelevant-in-practice to me; do you want to give an example or two of where it would be relevant? (2) has obvious relevance, but in practice I think most institutions do not have so many coordinators that it’s eating up a plurality of the budget, which is what I’d expect to see if there weren’t rapidly decreasing marginal returns on additional coordinators. (Though I could give a counterargument to that: there’s a story in which managers, who both handle most coordination in practice and make hiring decisions, tend to make themselves a bottleneck by under-hiring coordinators, since coordinators would compete with the managers for influence.) Also it is true that particularly good coordinators are extremely expensive, so I do still put some weight on (2).
I’m still not getting a good picture of what your thinking is on this. Seems like the inferential gap is wider than you’re expecting? Can you go into more details, and maybe include an example?
My intuition around (1) being important mostly comes from studying things like industrial organization and theory of the firm. If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies (“transaction costs”) would go away and it would become very puzzling why, for example, there are large hierarchical firms instead of everyone being independent contractors who just buy and sell their labor/services on the open market, or why monopolies are bad (i.e., cause “deadweight loss” in the economy).
I still have some uncertainty that maybe these ivory tower theories/economists are wrong, and you’re actually right about (1) not being that important, but I’d need some more explanations/arguments in that direction for it to be more than a small doubt at this point.
Oh that’s really interesting. I did a dive into theory of the firm research a couple years ago (mainly interested in applying it to alignment and subagent models) and came out with totally different takeaways. My takeaway was that the difficulty of credit assignment is a major limiting factor (and in particular this led to thinking about Incentive Design with Imperfect Credit Assignment, which in turn led to my current formulation of the Pointers Problem).
Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren’t necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.
For instance, a price mechanism doesn’t just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn’t need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.
… and now that I’m thinking about it, there’s a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.
Memetics example: in the vanilla HCH tree, some agent way down the tree ignores their original task and returns an answer which says “the top-level question asker urgently needs to know X!” followed by some argument. And that sort of argument, if it has high memetic fitness (independent of whether it’s correct), gets passed all the way back up the tree. The higher the memetic fitness, the further it propagates.
And if we have an exponentially large tree, with this sort of thing being generated a nontrivial fraction of the time, then there will be lots of these things generated. And there will be a selection process as more-memetically-fit messages get passed up, collide with each other, and people have to choose which ones to pass further up. What pops out at the top is, potentially, very-highly-optimized memes drawn from an exponentially large search space.
And of course this all applies even if the individual agents are all well-intentioned and trying their best. As with “unconscious economics”, it’s the selection pressures which dominate, not the individuals’ intentions.
With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.) It may well be the case that even if you removed this issue, credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)? If you know of some theory/math/model that says that credit assignment would be a big problem with HCH, why not talk about that instead?
Wei Dai says:
I’m going to jump in briefly to respond on one line of reasoning. John says the following, and I’d like to just give two examples from my own life of it.
Microcovid Tax
In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn’t do. We could order take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out.
We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone’s cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates).
This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn’t punish you or tell you off, we just charged you the microcovid tax.
This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else’s interests.
Paying for Resources
Sometimes within the Lightcone team, when people wish to make bids on others’ resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they’re going to get something done, or otherwise bid for a use of the other teammate’s resources, it’s common enough for someone to state a price, and then the action only goes through if both parties agree to a trade.
I don’t think this is because we all have different goals. I think it’s primarily because it’s genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee.
On some occasions I’m bidding for something that seems clearly to me to be the right call, but when the person is asked how much they’d need to be paid in order to make it worth it, they give a much higher number, and it turns out there was a hidden cost I was not modeling.
If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.
In sum
Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.
I am unclear to what extent this is a crux for the whole issue, but it does seem to me that insofar as Wei Dai believes (these are my words) “agents bending the credit-assignment toward selfish goals is the primary reason that credit assignment is difficult and HCH resolves it by having arbitrary many copies of the same (self-aligned) individual”, this is false.
I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.
Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
(I have added the point I wanted to add to this conversation, and will tap out now.)
Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.
Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I’ve worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I’ll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It’s hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).
And, to be clear, I’m not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it’s reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so.
Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
I’d be interested in your thoughts on [Humans-in-a-science-lab consulting HCH], for questions where we expect that suitable empirical experiments could be run on a significant proportion of subquestions. It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.
Would you still expect this to go badly wrong (assume you get to pick the humans)? If so, would you expect sufficiently large civilizations to be crippled through memetic selection by default? If [yes, no], what do you see as the important differences?
I don’t think it’s a gap in economic theory in general: pretty sure I’ve heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals—only on “What should x cost in context y?” being a question whose answer depends on the entire system.
Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just… refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they’re willing to test it out, but they don’t expect it to work, and it’s likely that their output is something like the string “lol nope”. I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them.
Now, there’s a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn’t talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine—they’d form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz.
Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That’s not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel.
I do not think memetic selection in particular would cripple those Johns, because that’s exactly the sort of thing they’d be on the lookout for. But I’m not confident of that. And I’d be a lot more pessimistic about the vast majority of other people. (I do expect that most people think a bureaucracy/society of themselves would work better than the bureaucracies/societies we have, and I expect that at least a majority and probably a large majority are wrong about that, because bureaucracies are generally made of median-ish people. So I am very suspicious of my inner simulator saying “well, if it was a bunch of copies of John Wentworth, they would know to avoid the failure modes which mess up real-world bureaucracies/societies”. Most people probably think that, and most people are probably wrong about it.)
I do think our current civilization is crippled by memetic selection to pretty significant extent. (I mean, that’s not the only way to frame it or the only piece, but it’s a correct frame for a large piece.)
Economists do talk about that sort of thing, but I don’t usually see it in their math. Of course we can get e.g. implied prices for any pareto-optimal system, but I don’t know of math saying that systems will end up using those implied prices internally.
Interesting, thanks. This makes sense to me.
I do think strong-HCH can support the ”...more like a simulated society...” stuff in some sense—which is to say that it can be supported so long as we can rely on individual Hs to robustly implement the necessary pointer passing (which, to be fair, we can’t).
To add to your “tree of John Wentworths”, it’s worth noting that H doesn’t need to be an individual human—so we could have our H be e.g. {John Wentworth, Eliezer Yudkowsky, Paul Christiano, Wei Dai}, or whatever team would make you more optimistic about lack of memetic disaster. (we also wouldn’t need to use the same H at every level)
Yeah, at some point we’re basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.
Like Wei Dai, I think there’s a bunch of pretty big disanalogies with real-world examples that make me more hopeful than you:
Typical humans in typical bureaucracies do not seem at all aligned with the goals that the bureaucracy is meant to pursue.
Since you reuse one AI model for each element of the bureaucracy, doing prework to establish sophisticated coordinated protocols for the bureaucracy takes a constant amount of effort, whereas in human bureaucracies it would scale linearly with the number of people. As a result with the same budget you can establish a much more sophisticated protocol with AI than with humans.
In the relay experiment looking at the results it intuitively feels like people could have done significantly better by designing a better protocol in advance and coordinating on it, though I wouldn’t say that with high confidence.
After a mere 100 iterations of iterated distillation and amplification where each agent can ask 2 subquestions, you are approximating a bureaucracy of 2^100 agents, which is wildly larger than any human bureaucracy and has qualitatively different strategies available to it. Probably it will be a relatively bad approximation but the exponential scaling with linear iterations still seems pretty majorly different from human bureaucracies.
I think these disanalogies are driving most of the disagreement, rather than things like “not knowing about real-world evidence” or even “failing to anticipate results in simple cases we can test today”. For example, for the relay experiment you mention, at least I personally (and probably others) did in fact anticipate these results in advance. Here’s a copy of this comment of mine (as a Facebook comment it probably isn’t public, sorry), written before anyone had actually played a relay game (bold added now, to emphasize where it agrees with what actually happened):
(I think I would have been significantly more optimistic if each individual person had, say, 30 minutes of time, even if they were working on a relatively harder problem. I didn’t find any past quotes to that effect though. In any case that’s how I feel about it now.)
One question is why Ought did these experiments if they didn’t expect success? I don’t know what they expected but I do remember that their approach was very focused on testing the hardest cases (I believe in order to find the most shaky places for Factored Cognition, though my memory is shaky there), so I’m guessing they also thought a negative outcome was pretty plausible.
Why would this be any different for simulated humans or for human-mimicry based AI (which is what ~all of the problem-factorization-based alignment strategies I’ve seen are based on)?
This one I buy. Though if it’s going to be the key load-bearing piece which makes e.g. something HCH-like work better than the corresponding existing institutions, then it really ought to play a more central role in proposals, and testing it on humans now should be a high priority. (Some of Ought’s work roughly fits that, so kudos to them, but I don’t know of anyone else doing that sort of thing.)
Empirically it does not seem like bureaucracies’ problems get better as they get bigger. It seems like they get worse. And like, sure, maybe there’s a phase change if you go to really exponentially bigger sizes, but “maybe there’s a phase change and it scales totally differently than we’re used to and this happens to be a good thing rather than a bad thing” is the sort of argument you could make about anything, we really need some other reason to think that hypothesis is worth distinguishing at all.
Kudos for correct prediction!
Continuing in the spirit of expressing my highly uncharitable intuitions, my intuitive reaction to this is “hmm Rohin’s inner simulator seems to be working fine, maybe he’s just not actually applying it to picture what would happen in an actual bureaucracy when making changes corresponding to the proposed disanalogies”. On reflection I think there’s a strong chance you have tried picturing that, but I’m not confident, so I mention it just in case you haven’t. (In particular disanalogy 3 seems like one which is unlikely to work in our favor when actually picturing it, and my inner sim is also moderately skeptical about disanalogy 2.)
One more disanalogy:
4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress, Copenhagen interpretation of ethics, startups being able to do illegal stuff), but this presumably won’t apply to alignment approaches.
One other thing I should have mentioned is that I do think the “unconscious economics” point is relevant and could end up being a major problem for problem factorization, but I don’t think we have great real-world evidence suggesting that unconscious economics by itself is enough to make teams of agents not be worthwhile.
Re disanalogy 1: I’m not entirely sure I understand what your objection is here but I’ll try responding anyway.
I’m imagining that the base agent is an AI system that is pursuing a desired task with roughly human-level competence, not something that acts the way a whole-brain emulation in a realistic environment would act. This base agent can be trained by imitation learning where you have the AI system mimic human demonstrations of the task, or by reinforcement learning on a reward model trained off of human preferences, but (we hope) is just trying to do the task and doesn’t have all the other human wants and desires. (Yes, this leaves a question of how you get that in the first place; personally I think that this distillation is the “hard part”, but that seems separate from the bureaucracy point.)
Even if you did get a bureaucracy made out of agents with human desires, it still seems like you get a lot of benefit from the fact that the agents are identical to each other, and so have less politics.
Re disanalogy 3: I agree that you have to think that a small / medium / large bureaucracy of Alices-with-15-minutes will at least slightly outperform an individual / small / medium bureaucracy of Alices-with-15-minutes before this disanalogy is actually a reason for optimism. I think that ends up coming from disanalogies 1, 2 and 4, plus some difference in opinion about real-world bureaucracies, e.g. I feel pretty good about small real-world teams beating individuals.
I mostly mention this disanalogy as a reason not to update too hard on intuitions like Can HCH epistemically dominate Ramanujan? and this SlateStarCodex post.
Yeah I have. Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)
I think a lot of alignment tax-imposing interventions (like requiring local work to be transparent for process-based feedback) could be analogous?
Hmm, maybe? There are a few ways this could go:
We give feedback to the model on its reasoning, that feedback is bad in the same way that “the rest of the world pays attention and forces dumb rules on them” is bad
“Keep your reasoning transparent” is itself a dumb rule that we force upon the AI system that leads to terrible bureaucracy problems
I’m unsure about (2) and mostly disagree with (1) (and I think you were mostly saying (2)).
Disagreement with (1): Seems like the disanalogy relies pretty hard on the rest of the world not paying much attention when they force bureaucracies to follow dumb rules, whereas we will presumably pay a lot of attention to how we give process-based feedback.
I was mostly thinking of the unconscious economics stuff.
I should have asked for a mental picture sooner, this is very useful to know. Thanks.
If I imagine a bunch of Johns, I think that they basically do fine, though mainly because they just don’t end up using very many Johns. I do think a small team of Johns would do way better than I do.
Yes I too have a rant along those lines from a post a while back, here it is:
I think we could play an endless and uninteresting game of “find a real-world example for / against factorization.”
To me, the more interesting discussion is around building better systems for updating on alignment research progress -
What would it look like for this research community to effectively update on results and progress?
What can we borrow from other academic disciplines? E.g. what would “preregistration” look like?
What are the ways more structure and standardization would be limiting / taking us further from truth?
What does the “institutional memory” system look like?
How do we coordinate the work of different alignment researchers and groups to maximize information value?
The problem with not using existing real-world examples as a primary evidence source is that we have far more bits-of-evidence from the existing real world, at far lower cost, than from any other source. Any method which doesn’t heavily leverage those bits necessarily makes progress at a pace orders of magnitude slower.
Also, in order for factorization to be viable for aligning AI, we need the large majority of real-world cognitive problems to be factorizable. So if we can find an endless stream of real-world examples of cognitive problems which humans are bad at factoring, then this class of approaches is already dead in the water.
What does “well” mean here? Like what would change your mind about this?
I have the opposite intuition from you: it’s clearly obvious that groups of people can accomplish things that individuals cannot; while there are inefficiencies from bureaucracy, those inefficiencies are regularly outweighed by the benefit having more people provides; and that benefit frequently comes from factorization (i.e. different parts of the company working on different bits of the same thing).
As one example: YCombinator companies have roughly linear correlation between exit value and number of employees, and basically all companies with $100MM+ exits have >100 employees. My impression is that there are very few companies with even $1MM revenue/employee (though I don’t have a data set easily available).
Two key points here.
First: a group of 100 people can of course get more done over a month than an individual, by expending 100 times as many man-hours as the individual. (In fact, simple argument: anything an individual can do in a month a group of 100 can also do in a month by just having one group member do the thing independently. In practice this doesn’t always work because people get really stupid in groups and might not think to have one person do the thing independently, but I think the argument is still plenty strong.) The question is whether the group can get as much done without any individual person doing a very large chunk of the work; each person should only need to do a small/simple task. That’s the point of problem factorization.
Second: the relevant question is not whether there exist factorizable problems; they clearly exist. (Assembly lines are proof of existence.) The question is whether there do not exist unfactorizable problems—more precisely, whether alignment can be solved without running into a single subproblem which humans cannot factor without missing some crucial consideration.
For more info on the sort of things which drive my intuition here, see Coordination as a Scarce Resource. If I suddenly found out that none of the examples in that post actually happened, or that they were all extremely unusual, then I’d mainly be very confused, but that would be the sort of thing which would potentially end in changing my mind about this.
I don’t think this is especially relevant, but I disagree with this picture on two counts. First, I think valuation tends to cause hiring, not vice versa—for instance, in google the very large majority of employees do not work on search, and the non-search employees account for a tiny fraction of the company’s income (at least as of last time I checked, which was admittedly a while ago). Second, Instagram: IIRC the company had 13 employees when it was acquired by Facebook for $1B. I would guess that there are plenty of very small $100M companies, we just don’t hear about them as often because few people have friends who work at them and they don’t need to publicize to raise capital.
Thanks! The point about existence proofs is helpful.
After thinking about this more, I’m just kind of confused about the prompt: Aren’t big companies by definition working on problems that can be factored? Because if they weren’t, why would they hire additional people?
Very late reply, reading this for the 2022 year in review.
So there are at least two different models which both yield this observation.
The first is that there are few people who can reliably create $1MM / year of value for their company, and so companies that want to increase their revenue have no choice but to hire more people in order to increase their profits.
The second is that it is entirely possible for a small team of people to generate a money fountain which generates billions of dollars in net revenue. However, once you have such a money fountain, you can get even more money out of it by hiring more people, comparative advantage style (e.g. people to handle mandatory but low-required-skill jobs to give the money-fountain-builders more time to do their thing). At equilibrium, companies will hire employees until the marginal increase in profit is equal to the marginal cost of the employee.
My crackpot quantitative model is that the speed with which a team can create value in a single domain scales with approximately the square root of the number of people on the team (i.e. a team of 100 will create 10x as much value as a single person). Low sample size but this has been the case in the handful of (mostly programming) projects I’ve been a part of as the number of people on the team fluctuates, at least for n between 1 and 100 on each project (including a project that started with 1, then grew to ~60, then dropped back down to 5).
Sure, I think everyone agrees that marginal returns to labor diminish with the number of employees. John’s claim though was that returns are non-positive, and that seems empirically false.
I don’t think “sandwiching” is best understood as a problem factorization experiment, though this is indeed one possible approach to improve performance in the sandwiching setting.
I prefer to define sandwiching as:
I think of sandwiching as the obvious way to assess a certain class of safety/alignment techniques rather than as a particularly opinionated approach.
I think the discussion here or possibly here presents a better perspective on sandwiching.
meta:
This seems to be almost exclusively based on the proxies of humans and human institutions. Reasons why this does not necessarily generalize to advanced AIs are often visible when looking from a perspective of other proxies, eg. programs or insects.
Sandwiching:
So far, progress of ML often led to this pattern:
1. ML models sort of suck, maybe help a bit sometimes. Humans are clearly better (“humans better”).
2. ML models get overall comparable to humans, but have different strengths and weaknesses; human+AI teams beat both best AIs alone, or best humans alone (“age of cyborgs”)
3. human inputs just mess up with superior AI suggestions (“age of AIs”)
(chess, go, creating nice images, poetry seems to be at different stages of this sequence)
This seems to lead to a different intuition than the lawyer-owner case.
Also: designer-engineer and lawyer-owner problems seem both related to communication bottleneck between two human brains.
If anyone has questions for Ought specifically, we’re happy to answer them as part of our AMA on Tuesday.
One successful example of factorization working is our immune systems. Our immune system does it’s job by defending the body without needing intelligence. In fact every member of the immune system is blind, naked, and unintelligent. Your body has no knowledge of how many bacteria/viruses/cancer cells are in your body, doubling time or how many infected cells are there. Thus this problem needs to be factored in order for the immune system to do anything at all, and indeed it is.
So the factorizing basically factors into different cells for different jobs, plus a few systems not connected to cells.
There are tens of different classes of cells that can be divided into a few subclasses, plus 5 major classes of antibodies. And they all have different properties, as well.
Now what does tell us about factorizing a problem, and are there any lessons on factorizing problems more generally?
The biggest reason factorization works for the immune system is they can make billions of them per day. One of bureaucracy’s biggest problems is we can’t simply copy skillets of people’s brains to lead bureaucracies so we have to hire them, and even without Goodhart’s law, this would introduce interface problems between people, and this leads to our next solved problem from the immune system, coordination. The most taut constraints are the rarity of talented people in companies.
Each cell of the immune system has the same source code, so there’s effectively no coordination problems at all, since everyone has the same attitude, dedication and abilities. It also partially solves the interface problem as everyone has shared understanding and shared ontologies. Unfortunately even if it solves the intra-organization problem, collaborating with others is an unsolved problem. Again this is something we can’t do. Your best case scenario is hiring relatively competent people with different source codes, abilities, ontologies and having to deal with interfacing problems due to differing source codes, beliefs, ontologies and competencies. They also can’t fully trust each other, even if they’re relatively aligned, due to trust issues, thus you need constant communication, which scales fairly poorly with the size of teams or bureaucrats.
So should we be optimistic about HCH/Debate/Factored Cognition? Yes! One of the most massive advantages AGI will have over regular people early on is that they can be copied very easily due to being digital, being able to cooperate fully and trust fully with copies of themselves since they share the same ways they reason, and a single mindedness towards goals that mostly alleviate the most severe problems of trust. They will also have similar ontologies, as well. So you don’t realize just how much AGI can solve those problems like coordination and interface issues.
EDIT: I suspect a large part of the reason your intuition is recoiling against the HCH/Debate/Factored Cognition solutions is because of scope neglect. Our intuitions don’t work well with extremely big or small numbers, and a commenter once claimed that 100 distillation steps could produce 2^100 agents. To put it lightly, this is a bureaucracy that has 10^30 humans, with a corresponding near-infinite budget, that is single mindedly trying to answer questions. To put it another way, that’s more humans than have ever lived by a factor of 19. And with perfect coordination, trust, single-mindedness towards goals and with a smooth interface due to same ontologies, due to being digital, such a bureaucracy could plausibly solve every problem in the entire universe, since I suspect that a large part of the problem for alignment is that at the end of the day, capabilities groups have much more money and researchers available to them than safety groups, and capabilities researchers are not surprisingly winning the race.
Our intuitions fail us here, so they aren’t a reliable guide to estimating how a very large HCH tree works.
Yeah bio analogies! I love this comment.
One can trivially get (some variants of) HCH to do things by e.g. having each human manually act like a transistor, arranging them like a computer, and then passing in a program for an AGI to the HCH-computer. This, of course, is entirely useless for alignment.
There’s a whole class of problem factorization proposals which fall prey to that sort of issue. I didn’t talk about it in the post, but it sounds like it probably (?) applies to the sort of thing you’re picturing.
It’s not useless, but it’s definitely risky to do it, and the things required for safety would mean distillation has to be very cheap. And here we come to the question “How doomed by default are we if AGI is created?” If the chance is low, I agree that it’s not a risk worth taking. If high, then you’ll probably have to do it. The more doomed by default you think creating AGI is, the more risk you should take, especially with short timelines. So MIRI would probably want to do this, given their atypically high levels of doominess, but most other organizations probably won’t do it due to thinking of fairly low risk from AGI.
I do think humans can work together. I think they can work together on projects large enough that no single one of them could complete it. I don’t think anyone is disputing this. Somewhat more speculatively, I expect that copies of a sufficiently generally skilled proto-AGI could be swapped into many of the roles of a large team project.
I think the closer this proto-AGI is to being fully as general and competent in all areas that humans can be competent, the more roles you will be able to swap it in for.
The question then is about how much you can simplify and incrementalize the teamwork process before it stops working. And which of these roles you can then swap a today-level proto-AGI into. There’s definitely some overhead in splitting up tasks, so the smaller the pieces the task is broken into, the more inefficiency you should expect to accrue.
Well, Ought gathered some data about a point on the spectrum of role division that is definitely too simple/unstructured/incremental to be functional. I don’t think they need to find the exact minimum point on this spectrum for the research agenda to proceed. So my personal recommendation would be to aim at something which seems more likely to be on the functional side of the spectrum.
Here’s my guess at a plan which covers the necessary structural aspects:
Planning stage
‘Broad vision solution suggester’ followed by a ‘Specific task and task-order Planner’ Define inputs and outputs of the planned tasks. Specifies what constitutes success for each part. Orders tasks based on dependencies.
Building stage
For each task: must complete task, declare success or failure. If success, describe specific operating parameters of result. (Might exceed minimum specifications).
If failure is declared, return to planning stage with goal: plan a way around the encountered block, then return to building stage to implement the new plan.
Integration stage
Integration specialist: does it all work together as planned? If no, return to planning with goal: create new tasks to cover making these pieces work together. Then proceed to building stage to implement the new plan.
The agents at the top of most theoretical infinite bureaucracies should be thought of as already superhumanly capable and aligned, not weak language models, because the way IDA works iteratively retrains models on output of bureaucracy, so that agents at higher levels of the theoretical infinite bureaucracy are stronger (from later amplification/distillation epochs) than those at lower levels. It doesn’t matter if an infinite bureaucracy instantiated for a certain agent fails to solve important problems, as long as the next epoch does better.
For HCH specifically, this is normally intended to apply to the HCHs, not to humans in it, but then the abstraction of humans being actual humans (exact imitations) leaks, and we start expecting something other than actual humans there. If this is allowed, if something less capable/aligned than humans can appear in HCH, then by the same token these agents should improve with IDA epochs (perhaps not of HCH, but of other bureaucracies) and those “humans” at the top of an infinite HCH should be much better than the starting point, assuming the epochs improve things.
On the arbitrarily big bureaucracy, the real reason it works is because by assumption, we can always add more agents, and thus we can simulate any Turing-complete system. Once that’s removed as an assumption, the next question is: Is distillation cheap?
If it is, such that I can distill hundreds or thousands of layers, it’s ludicrously easy to solve the alignment problem, even with pessimistic views on AI bureaucracies/debate.
If I can distill 25-100 layers, it’s still likely to be able to solve the alignment problem, albeit at that lower end I’ll probably disagree with John Wentworth on how optimistic you should be on bureaucracies/debate for solving alignment.
Below 20-25 layers, John Wentworth’s intuition will probably disagree with me on how useful AI bureaucracies/debate for solving alignment. Specifically he’d almost certainly think that such a bureaucracy couldn’t work at all compared to independent researchers. I view AI and human bureaucracies as sufficently disanalogous such that the problems of human bureaucracies isn’t likely to hold. My take is with just 20 distillation layers, you’d have a fair chance of solving the whole problem, and to contribute to AI Alignment usefully, only 10 layers are necessary.
Since I recently wrote an article endorsing Factorization as an alignment approach, I feel like I should respond here.
Everyone who proposes Factorization agrees there is a tradeoff between factorization and efficiency. The question is, how bad is that tradeoff?
Factorization is not a solution to the problem of general intelligence. However there are a lot of problems that we should reasonably expect can be factorized
Each human having 5 minutes with Google doc does not seem like a good way to factorize problems
John seems wrongly pessimistic about the “Extremely Long Jury Trial”. We know from math that “you prove something, I check your work” is an extremely powerful framework. I would expect this to be true in real life as well.
Huh, really? I think my impression from talking to Paul over the years was that it sort of was. [Like, if your picture of the human brain is that it’s a bunch of neurons factorizing the problem of being a human, this sort of has to work.]
This seems true, but not central. It’s more like generalized prompt engineering, bureaucracies amplify certain aspects of behavior to generate data for models better at (or more robustly bound by) those aspects. So this is useful even if the building blocks are already AGIs, except in how deceptive alignment could make that ineffective. The central use is to amplify alignment properties of behavior with appropriate bureaucracies, retraining the models with their output.
If applied to capability at solving problems, this is a step towards AGI (and marks the approach as competitive). My impression is that Paul believes this application feasible to a greater extent than most other people, and by extension expects other bureaucracies to do sensible things at a lower capability of an agent. But this is more plausible than a single infinite bureaucracy working immediately with a weak agent, because all it needs is to improve things at each IDA cycle, making agents in the bureaucracy a little bit more capable and aligned, even if they are still a long way from solving difficult problems.
These examples conflate “what the human who provided the task to the AI+human combined system wants” with “what the human who is working together with the AI wants” in a way that I think is confusing and sort of misses the point of sandwiching. In sandwiching, “what the human wants” is implicit in the choice of task, but the “what the human wants” part isn’t really what is being delegated or factored off to the human who is working together with the AI; what THAT human wants doesn’t enter into it at all. Using Cotra’s initial example to belabor the point: if someone figured out a way to get some non-medically-trained humans to work together with a mediocre medical-advice-giving AI in such a way that the output of the combined human+AI team is actually good medical advice, it doesn’t matter whether those non-medically-trained humans actually care that the result is good medical advice; they might not even individually know what the purpose of the system is, and just be focused on whatever their piece of the task is—say, verifying the correctness of individual steps of a chain of reasoning generated by the system, or checking that each step logically follows from the previous, or whatever. Of course this might be really time intensive, but if you can improve even slightly on the performance of the original mediocre system, then hopefully you can train a new AI system to match the performance of the original AI+human system by imitation learning, and bootstrap from there.
The point, as I understand it, is that if we can get human+AI systems to progress from “mediocre” to “excellent” (in other words, to remain aligned with the designer’s goal) -- despite the fact that the only feedback involved is from humans who wouldn’t even be mediocre at achieving the designer’s goal if they were asked to do it themselves—and if we can do it in a way that generalizes across all kinds of tasks, then that would be really promising. To me, it seems hard enough that we definitely shouldn’t take a few failed attempts as evidence that it can’t be done, but not so hard as to seem obviously impossible.
I at least partially buy this, but it seems pretty easy to update the human analogies to match what you’re saying. Rather than analogizing to e.g. a product designer + software engineer, we’d analogize to the tech company CEO trying to build some kind of product assembly line which can reliably produce good apps without any of the employees knowing what the product is supposed to be. Which still seems like something for which there’s already immense economic pressure, and we still generally can’t do it well for most cognitive problems (although we can do it well for most manufacturing problems).
Thanks, I agree that’s a better analogy. Though of course, it isn’t necessary that none of the employees (participants in a sandwiching project) are unaware of the CEO’s (sandwiching project overseer’s) goal; I was only highlighting that they need not necessarily be aware of it in order to make it clear that the goals of the human helpers/judges aren’t especially relevant to what sandwiching, debate, etc. is really about. But of course if it turns out that having the human helpers know what the ultimate goal is helps, then they’re absolutely allowed to be in on it...
Perhaps this is a bit glib, but arguably some of the most profitable companies in the mobile game space have essentially built product assembly lines to churn out fairly derivative games that are nevertheless unique enough to do well on the charts, and they absolutely do it by factoring the project of “making a game” into different bits that are done by different people (programmers, artists, voice actors, etc.), some of whom might not have any particular need to know what the product will look like as a whole to play their part.
However, I don’t want to press too hard on this game example as you may or may not consider this ‘cognitive work’ and as it has other disanalogies with what we are actually talking about here. And to a certain degree I share your intuition that factoring certain kinds of tasks is probably very hard: if it wasn’t, we might expect to see a lot more non-manufacturing companies whose employee main base consists of assembly lines (or hierarchies of assembly lines, or whatever) requiring workers with general intelligence but few specialized rare skills, which I think is the broader point you’re making in this comment. I think that’s right, although I also think there are reasons for this that go beyond just the difficulty of task factorization, and which don’t all apply in the HCH etc. case, as some other commenters have pointed out.
2 years and 2 days later, in your opinion, has what you predicted in your conclusion happened?
(I’m just a curious bystander; I have no idea if there are any camps regarding this issue, but if so, I’m not a member of any of them.)
The most recent thing I’ve seen on the topic is this post from yesterday on debate, which found that debate does basically nothing. In fairness there have also been some nominally-positive studies (which the linked post also mentions), though IMO their setup is more artificial and their effect sizes are not very compelling anyway.
My qualitative impression is that HCH/debate/etc have dropped somewhat in relative excitement as alignment strategies over the past year or so, more so than I expected. People have noticed the unimpressive results to some extent, and also other topics (e.g. mechinterp, SAEs) have gained a lot of excitement. That said, I do still get the impression that there’s a steady stream of newcomers getting interested in it.
Do you think factorization/debate would work for math? That is, do you think that to determine the truth of an arbitrarily complex mathematical argument with high probability, you or I could listen to a (well-structured) debate of superintelligences?
I ask because I’m not sure whether you think factorization just doesn’t work, or alignment-relevant propositions are much harder to factor than math arguments. (My intuition is that factorization works for math; I have little intuition for the difficulty of factoring alignment-relevant propositions but I suspect some are fully factorable.)
I be a lot more optimistic about it for math than for anything touching the real world.
Also, there are lots of real-world places where factorization is known to work well. Basically any competitive market, with lots of interchangeable products, corresponds to a good factorization of some production problem. Production lines, similarly, are good factorizations. The issue is that we can’t factor problems in general, i.e. there’s still lots of problems we can’t factor well, and using factorization as our main alignment strategy requires fairly general factorizability (since we have to factor all the sub-problems of alignment recursively, which is a whole lot of subproblems, and it only takes one non-human-factorable subproblem to mess it all up).
Shameless plug of my uncharitable criticism that I believe has a similar vibe: ELK shaving