My AI Model Delta Compared To Christiano
Preamble: Delta vs Crux
This section is redundant if you already read My AI Model Delta Compared To Yudkowsky.
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution—in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Paul Christiano’s AI models. When I apply the delta outlined here to my models, and propagate the implications, my models mostly look like Paul’s as far as I can tell. That said, note that this is not an attempt to pass Paul’s Intellectual Turing Test; I’ll still be using my own usual frames.
My AI Model Delta Compared To Christiano
Best guess: Paul thinks that verifying solutions to problems is generally “easy” in some sense. He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.
What do my models look like if I propagate that delta? Well, it implies that delegation is fundamentally viable in some deep, general sense.
That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for—the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn’t influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad. (For a more game-theoretic angle, see When Hindsight Isn’t 20⁄20.)
On (my model of) Paul’s worldview, that sort of thing is rare; at most it’s the exception to the rule. On my worldview, it’s the norm for most goods most of the time. See e.g. the whole air conditioner episode for us debating the badness of single-hose portable air conditioners specifically, along with a large sidebar on the badness of portable air conditioner energy ratings.
How does the ease-of-verification delta propagate to AI?
Well, most obviously, Paul expects AI to go well mostly via humanity delegating alignment work to AI. On my models, the delegator’s incompetence is a major bottleneck to delegation going well in practice, and that will extend to delegation of alignment to AI: humans won’t get what we want by delegating because we don’t even understand what we want or know what to pay attention to. The outsourced alignment work ends up bad in nonobvious/nonsalient (but ultimately important) ways for the same reasons as most goods in my house. But if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.
Then we can go even more extreme: HCH, aka “the infinite bureaucracy”, a model Paul developed a few years ago. In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve. On my models, HCH adds recursion to the universal pernicious difficulties of delegation, and my main response is to run away screaming. But on Paul’s models, delegation is fundamentally viable, so why not delegate recursively?
(Also note that HCH is a simplified model of a large bureaucracy, and I expect my views and Paul’s differ in much the same way when thinking about large organizations in general. I mostly agree with Zvi’s models of large organizations, which can be lossily-but-accurately summarized as “don’t”. Paul, I would guess, expects that large organizations are mostly reasonably efficient and reasonably aligned with their stakeholders/customers, as opposed to universally deeply dysfunctional.)
Propagating further out: under my models, the difficulty of verification accounts for most of the generalized market inefficiency in our world. (I see this as one way of framing Inadequate Equilibria.) So if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit. That, in turn, has a huge effect on timelines. Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research). That’s the sort of thing which leads to a relatively discontinuous takeoff. Paul, on the other hand, expects a relatively smooth takeoff—which makes sense, in a world where there’s not a lot of low-hanging fruit in the software/algorithms because it’s easy for users to notice when the libraries they’re using are trash.
That accounts for most of the known-to-me places where my models differ from Paul’s. I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch. (I do still put significantly-nonzero probability on successful outsourcing of most alignment work to AI, but it’s not the sort of thing I expect to usually work.)
I was surprised to read the delta propagating to so many different parts of your worldviews (organizations, goods, markets, etc), and that makes me think that it’d be relatively easier to ask questions today that have quite different answers under your worldviews. The air conditioner one seems like one, but it seems like we could have many more, and some that are even easier than that. Plausibly you know of some because you’re quite confident in your position; if so, I’d be interested to hear about them[1].
At a meta level, I find it pretty funny that so many smart people seem to disagree on the question of whether questions usually have easily verifiable answers.
I realize that part of your position is that this is just really hard to actually verify, but as in the example of objects in your room it feels like there should be examples where this is feasible with moderate amounts of effort. Of course, a lack of consensus on whether something is actually bad if you dive in further could also be evidence for hardness of verification, even if it’d be less clean.
Yeah, I think this is very testable, it’s just very costly to test—partly because it requires doing deep dives on a lot of different stuff, and partly because it’s the sort of model which makes weak claims about lots of things rather than very precise claims about a few things.
And at a twice-meta level, that’s strong evidence for questions not generically having verifiable answers (though not for them generically not having those answers).
(That’s what I meant, though I can see how I didn’t make that very clear.)
So on the Ω-meta-level you need to correct weakly in the other direction again.
To some extent, this is all already in Jozdien’s comment, but:
It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people’s level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let’s ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn’t taken to be very informative (by at least one of Paul and John), then it’s hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.
I think it depends on which domain you’re delegating in. E.g. physical objects, especially complex systems like an AC unit, are plausibly much harder to validate than a mathematical proof.
In that vein, I wonder if requiring the AI to construct a validation proof would be feasible for alignment delegation? In that case, I’d expect us to find more use and safety from [ETA: delegation of] theoretical work than empirical.
That seems a lot like Davidad’s alignment research agenda.
First, I want to flag that I really appreciate how you’re making these delta clear and (fairly) simple.
I like this, though I feel like there’s probably a great deal more clarity/precision to be had here (as is often the case).
I’m not sure what “bad” means exactly. Do you basically mean, “if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?” If so, I assume we’d all agree that this is true for some amount R, the key question is what that amount is.
I also would flag that you draw attention to the issue with air conditioners. But for the issue of personal items, I’d argue that when I learn more about popular items, most of what I learn are positive things I didn’t realize. Like with Chesterton’s fence—when I get many well-reviewed or popular items, my impression is generally that there were many clever ideas or truths behind those items that I don’t at all have time to understand, let alone invent myself. A related example is cultural knowledge—a la The Secret of Our Success.
When I try out software problems, my first few attempts don’t go well for reasons I didn’t predict. The very fact that “it works in tests, and it didn’t require doing anything crazy” is a significant update.
Sure, with enough resources R, one could very likely make significant improvements to any item in question—but as a purchaser, I only have resources r << R to make my decisions. My goal is to buy items to make my life better, it’s fine that there are potential other gains to be had by huge R values.
> “verification is easier than generation”
I feel like this isn’t very well formalized. I think I agree with this comment on that post. I feel like you’re saying, “It’s easier to generate a simple thing than verify all possible things”, but Paul and co are saying more like, “It’s easier to verify/evaluate a thing of complexity C than generate a think of complexity C, in many important conditions”, or, “There are ways of delegating many tasks where the evaluation work required would be less than that of doing the work yourself, in order to get a result of a certain level of quality.”
I think that Paul’s take (as I understand it) seems like a fundamental aspect about the working human world. Humans generally get huge returns from not inventing the wheel all the time, and deferring to others a great deal. This is much of what makes civilization possible. It’s not perfect, but it’s much better than what individual humans could do by themselves.
> Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research)
I appreciate the precise prediction, but don’t see how it exactly follows. This seems more like a question of “how much better will early AIs be compared to current humans”, than one deeply about verification/generation. Also, I’d flag that in many worlds, I’d expect that pre-AGI AIs could do a lot of this code improvement—or they already have—so it’s not clear exactly how big a leap the “autonomously” is doing here.
---
I feel like there are probably several wins to be had by better formalizing these concepts more. They seem fairly cruxy/high-delta in the debates on this topic.
I would naively approach some of this with some simple expected value/accuracy lens. There are many assistants (including AIs) that I’d expect would improve the expected accuracy on key decisions, like knowing which AI systems to trust. In theory, it’s possible to show a bunch of situations where delegation would be EV-positive.
That said, a separate observer could of course claim that one using the process above would be so wrong as to be committing self-harm. Like, “I think that when you would try to use delegation, your estimates of impact are predictably wrong in ways that would lead to you losing.” But this seems like mainly a question about “are humans going to be predictably overconfident in a certain domain, as seen by other specific humans”.
Thinking about this more, it seems like there are some key background assumptions that I’m missing.
Some assumptions that I often hear get presenting on this topic are things like:
1. “A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard.”
2. “We need to generally have incredibly high assurances to build powerful systems that don’t kill us”.
My obvious counter-arguments would be:
1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there’s likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive.
2. “Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won’t require such high assurances.” (This is similar to the HCH / oversight arguments)
I think an interesting version of this is “if I were to spend resource R evaluating this object, I could identify some ways for it to be significantly improved (even when factoring in additinoal cost) that the productino team probably already knew about”
I expect you still believe P != NP?
Yes, though I would guess my probability on P = NP is relatively high compared to most people reading this. I’m around 10-15% on P = NP.
Notably relevant:
Do you expect A.G.I. to be solving problems outside of NP? If not, it seems the relevant follow-up question is really out of the problems that are in NP, how many are in P?
Actually, my intuition is that deep learning systems cap out around P/poly, which probably strictly contains NP, meaning (P/poly) \ NP may be hard to verify, so I think I agree with you.
Most real-world problems are outside of NP. Let’s go through some examples...
Suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values). Can I easily write down a boolean circuit (possibly with some inputs from data on fridges) which is satisfiable if-and-only-if this fridge in particular is in fact the best option for me according to my own long-term values? No, I have no idea how to write such a boolean circuit at all. Heck, even if my boolean circuit could internally use a quantum-level simulation of me, I’d still have no idea how to do it, because neither my stated values nor my revealed preferences are identical to my own long-term values. So that problem is decidedly not in NP.
(Variant of that problem: suppose an AI hands me a purported mathematical proof that this fridge in particular is the best option for me according to my own long-term values. Can I verify the proof’s correctness? Again, no, I have no idea how to do that, I don’t understand my own values well enough to distinguish a proof which makes correct assumptions about my values from one which makes incorrect assumptions.)
A quite different example from Hindsight Isn’t 20⁄20: suppose our company has 100 workers, all working to produce a product. In order for the product to work, all 100 workers have to do their part correctly; if even just one of them messes up, then the whole product fails. And it’s an expensive one-shot sort of project; we don’t get to do end-to-end tests a billion times. I have been assigned to build the backup yellow connector widget, and I do my best. The product launches. It fails. Did I do my job correctly? No idea, even in hindsight; isolating which parts failed would itself be a large and expensive project. Forget writing down a boolean circuit in advance which is satisfiable if-and-only-if I did my job correctly; I can’t even write down a boolean circuit in hindsight which is satisfiable if-and-only-if I did my job correctly. I simply don’t have enough information to know.
Another kind of example: I read a paper which claims that FoxO mediates the inflammatory response during cranial vault remodelling surgery. Can I easily write down a boolean circuit (possibly with some inputs from the paper) which is satisfiable if-and-only-if the paper’s result is basically correct? Sure, it could do some quick checks (look for p-hacking or incompetently made-up data, for example), but from the one paper I just don’t have enough information to reliably tell whether the result is basically correct.
Another kind of example: suppose I’m building an app, and I outsource one part of it. The contractor sends me back a big chunk of C code. Can I verify that (a) the C code does what I want, and (b) the C code has no security holes? In principle, formal verification tools advertise both of those. In practice, expressing what I want in a formal verification language is as-much-or-more-work as writing the code would be (assuming that I understand what I want well enough to formally express it at all, which I often don’t). And even then, I’d expect someone who’s actually good at software security to be very suspicious of the assumptions made by the formal verifier.
I think the issues here are more conceptual than algorithmic.
The conceptual vagueness certainly doesn’t help, but in general generation can be easier than validation because when generating you can stay within a subset of the domain that you understand well, whereas when verifying you may have to deal with all sorts of crazy inputs.
Attempted rephrasing: you control how you generate things, but not how others do, so verifying their generations can expose you to stuff you don’t know how to handle.
Example:
“Writing code yourself is often easier than validating someone else’s code”
I think a more nuanced take is there is a subset of generated outputs that are hard to verify. This subset is split into two camps, one where you are unsure of the outputs correctness (and thus can reject/ask for an explanation). This isn’t too risky. The other camp is ones where you are sure but in reality overlook something. That’s the risky one.
However at least my priors tell me that the latter is rare with a good reviewer. In a code review, if something is too hard to parse, a good reviewer will ask for an explanation or simplification. But bugs still slip by so it’s imperfect.
The next question is whether the bugs that slip by in the output will be catastrophic. I don’t think it dooms the generation + verification pipeline if the system is designed to be error tolerant.
I’d like to try another analogy, which makes some potential problems for verifying output in alignment more legible.
Imagine you’re a customer and ask a programmer to make you an app. You don’t really know what you want, so you give some vague design criteria. You ask the programmer how the app works, and they tell you, and after a lot of back and forth discussion, you verify this isn’t what you want. Do you know how to ask for what you want, now? Maybe, maybe not.
Perhaps the design space you’re thinking of is small, perhaps you were confused in some simple way that the discussion resolved, perhaps the programmer worked with you earnestly to develop the design you’re really looking for, and pointed out all sorts of unknown unknowns. Perhaps.
I think we could wind up in this position. The position of a non-expert verifying an experts’ output, with some confused and vague ideas about what we want from the experts. We won’t know the good questions to ask the expert, and will have to rely on the expert to help us. If ELK is easy, then that’s not a big issue. If it isn’t, then that seems like a big issue.
I feel like a lot of the difficulty here is a punning of the word “problem.”
In complexity theory, when we talk about “problems”, we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for “complexity”). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT (“count the number of satisfying assignments of this Boolean formula”), and it’s generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow (“try every assignment and count up the ones that satisfy the formula”).
On the other hand, “suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)” is a very different sort of beast. I agree it’s not in NP in that I can’t easily verify a solution, but the issue is that it’s not a problem_C, rather than it being a problem_C that’s (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task using exponential amounts of compute; for “choose a refrigerator”, I can’t describe any computational process that will solve at all. If I could (for instance, if I could write down an evaluation function f : fridge → R (where f was computable in P)), then the problem would be not only in NP but in P (evaluate each fridge, pick the best one).
So it’s not wrong to say that “choose a refrigerator” is not (known to be) in NP, but it’s important to foreground that that’s because the task isn’t written as a problem_C, rather than because it needs a lot of compute. So discussions about complexity classes and relative ease of generation and verification seem not especially relevant.
I don’t think I’m saying anything non-obvious, but I also think I’m seeing a lot of discussions that don’t seem to fully internalize this?
PCP theorem states that problems with probabilistically checkable in polynomial time verifications contain NEXP problems, so, in some sense, there is a very large class of problems that can be “easily” verified.
I think the whole “verification is easier than generation because of computational complexity theory” line of reasoning is misguided. The problem is not whether we have enough computing power to verify solution, it is that we have no idea how to verify solution.
Curated!
This post is strong in the rationalist virtue of simplicity. There is a large body of quite different research and strategic analysis of the AI x-risk situation between Wentworth and Christiano, and yet this post claims (I think fairly accurately) that much of it can be well captured in one key worldview-difference. The post does a good job of showing how this difference appears in many situations/cases (e.g. the air conditioning unit, large bureaucracies, outsourcing alignment, etc).
I encourage someone who takes the opposing side of this position from John (e.g. someone at the Alignment Research Center) to provide a response, as to whether they think this characterization is accurate (and if yes, why they disagree).
I don’t think this characterization is accurate at all, but don’t think I can explain the disagreement well enough for it to be productive.
I had interpeted your initial comment to mean “this post doesn’t accurately characterize Pauls views” (as opposed to “John is confused/wrong about the object level of ‘is verification easier than generation’ in a way that is relevant for modeling AI outcomes”)
I think your comment elsethread was mostly commenting on the object level. I’m currently unsure if your line “I don’t think this characterization is accurate at all” was about the object level, or about whether this post successfully articulates a difference in Paul’s views vs Johns.
I think both that:
this is not a good characterization of Paul’s views
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify
I don’t think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing alignment of that system.
Standard candidates for such properties include:
Strategic deception
Whether the system builds a child AI
Whether the system’s notion of “human” or “dead” or [...] generalizes in a similar way to our notions
… actually, on reflection, there is one version of the bailey which I might endorse: because easy-to-verify properties are generally outsourceable, whenever some important property is hard to verify, achieving that hard-to-verify property is the main bottleneck to solving the problem.
I don’t think one actually needs to make that argument in order for the parent comment to go through, but on reflection it is sometimes load-bearing for my models.
For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.
I’m hearing you say “If there’s lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process”.
And I’m hearing John’s position as “At a sufficient power level, if a single one of them gets through your training process you’re screwed. And some of the types will be very hard to verify the presence of.”
And then I’m left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
(The following is my own phrasing, not the linked post’s.)
(I didn’t want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)
I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.
I liked Rohin’s comment elsewhere on this general thread.
I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.
As someone who has used this insight that verification is easier than generation before, I heartily support this point:
One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it’s much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it’s so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem.
This link is where I got this quote:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I also agree that these 2 theses are worth distinguishing:
I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest.
When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would’ve rather done it myself, as it’s one step of a larger plan for building something and I’d do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it’s a single feature in a larger project, I often can’t tell if that was the right feature to build without knowing the full project plan, and even then I’d rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else’s designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don’t know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I’d rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
Okay, I think I’ve found the crux here:
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963
https://www.nature.com/articles/s41593-022-01026-4
https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full
https://www.nature.com/articles/s42003-022-03036-1
https://arxiv.org/abs/2306.01930
To answer some side questions:
how close to having a utility function am I?
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.
You can’t explain the disagreement well-enough to be productive yet! I have faith that you may be able to in the future, even if not now.
For reference, readers can see Paul and John debate this a little in this 2022 thread on AIs doing alignment research.
The claim is verification is easier than generation. This post considers a completely different claim that “verification is easy”, e.g.
I just don’t care much if the refrigerator or keyboard or tupperware or whatever might be bad in non-obvious ways that we failed to verify, unless you also argue that it would be easier to create better versions from scratch than to notice the flaws.
Now to be fair, maybe Paul and I are just fooling ourselves, and really all of our intuitions come from “verification is easy”, which John gestures at:
But I don’t think “verification is easy” matters much to my views. Re: the three things you mention:
From my perspective (and Paul’s) the air conditioning thing had very little bearing on alignment.
In principle I could see myself thinking bureaucracies are terrible given sufficient difficulty-of-verification. But like, most of my reasoning here is just looking at the world and noticing large bureaucracies often do better (see e.g. comments here). Note I am not saying large human bureaucracies don’t have obvious, easily-fixable problems—just that, in practice, they often do better than small orgs.
Separately, from an alignment perspective, I don’t care much what human bureaucracies look like, since they are very disanalogous to AI bureaucracies.
If you take AI progress as exogenous (i.e. you can’t affect it), outsourcing safety is a straightforward consequence of (a) not-super-discontinuous progress (sometimes called “slow takeoff”) and (b) expecting new problems as capability increases.
Once you get to AIs that are 2x smarter than you, and have to align the AIs that are going to be 4x smarter than you, it seems like either (a) you’ve failed to align the 2x AIs (in which case further human-only research seems unlikely to change much, so it doesn’t change much if you outsource to the AIs and they defect) or (b) you have aligned the 2x AIs (in which case your odds for future AIs are surely better if you use the 2x AIs to do more alignment research).
Obviously “how hard is verification” has implications for whether you work on slowing AI progress, but this doesn’t seem central.
There’s lots of complications I haven’t discussed but I really don’t think “verification is easy” ends up mattering very much to any of them.
I disagree with this curation because I don’t think this post will stand the test of time. While Wentworth’s delta to Yudkowsky has a legible takeaway—ease of ontology translation—that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.
Verification vs generation has an extremely wide space of possible interpretations, and as stated here the claim is incredibly vague. The argument for why difficulty of verification implies difficulty of delegation is not laid out, and the examples do not go in much depth. John says that convincing people is not the point of this post, but this means we also don’t really have gears behind the claims.
The comments didn’t really help—most of the comments here are expressing confusion, wanting more specificity, or disagreeing whereupon John doesn’t engage. Also, Paul didn’t reply. I don’t feel any more enlightened after reading them except to disagree with some extremely strong version of this post...
Vanilla HCH is an 8-year-old model of delegation to AIs which Yudkowsky convinced me was not aligned in like 2018. Why not engage with the limiting constructions in 11 Proposals, the work in the ELK report, recent work by ARC, recent empirical work on AI debate?
I agree that this pointer to a worldview-difference is pretty high-level / general, and the post would be more valuable with a clearer list of some research disagreements or empirical disagreements. Perhaps I made a mistake to curate a relatively loose pointer. I think I assign at least 35% to “if we’re all still alive in 5 years and there’s a much stronger public understanding of Christiano’s perspective on the world, this post will in retrospect be a pretty good high-level pointer to where he differs from many others (slash a mistake he was making)”, but I still appreciate the datapoint that you (and Mark and Rohin) did not find it helpful nor agree with it, and it makes me think it more probable that I made a mistake.
I have left a comment about a central way I think this post is misguided: https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano?commentId=sthrPShrmv8esrDw2
Very strong claim which the post doesn’t provide nearly enough evidence to support
I mean, yeah, convincing people of the truth that claim was not the point of the post.
Sorry, was in a hurry when I wrote this. What I meant / should have said is: it seems really valuable to me to understand how you can refute Paul’s views so confidently and I’d love to hear more.
This post uses “I can identify ways in which chairs are bad” as an example. But it’s easier for me to verify that I can sit in a chair and that it’s comfortable then to make a chair myself. So I don’t really know why this is a good example for “verification is easier than generation”.
More examples:
I can tell my computer is a good typing machine, but cannot make one myself
I can tell a waterbottle is water tight, but do not know how to make a water bottle
I can tell that my pepper grinder grinds pepper, but do not know how to make a pepper grinder.
If the goal of this post is to discuss the crux https://www.lesswrong.com/posts/fYf9JAwa6BYMt8GBj/link-a-minimal-viable-product-for-alignment?commentId=mPgnTZYSRNJDwmr64:
then I think there is a large disconnect between the post above, which is positing that in order for this claim to be false there has to be some “deep” sense in which delagation is viable, and the sense in which I think this crux is obviously false in the more mundane sense in which all humans interface with the world and optimize over the products other people create, and are therefore more capable than they would have been if they had to make all products for themselves from scratch.
I assumed John was pointing at verifying that perhaps the chemicals used in the production of the chair might have some really bad impact on the environmnet, start causing a problem with the food chain eco system and make food much scarcers for everyone—including the person who bought the chair—in the meaningfully near future. Something a long those lines.
As you note, verifying the chair functions as you want—as a place to sit that is comfortable—is pretty easy. Most of us probably do that without even really thinking about it. But will this chair “kill me” in the future is not so obvious or easy to assess.
I suspect at the core, this is a question about an assumption about evaluating a simple/non-complex world and doing so in an inherently complex world do doesn’t allow true separability in simple and independant structures.
What I had in mind is more like: many times over the years I’ve been sitting at a desk and noticed my neck getting sore. Then when I move around a bit, I realize that the chair/desk/screen are positioned such that my neck is at an awkward angle when looking at the screen, which makes my neck sore when I hold that angle for a long time. The mispositioning isn’t very salient; I just reflexively adjust my neck to look at the screen and don’t notice that it’s at an awkward angle. Then later my neck hurts, and it’s nonobvious and takes some examination to figure out why my neck hurts.
That sort of thing, I claim, generalizes to most “ergonomics”. Chairs, keyboards, desks, mice… these are all often awkward in ways which make us uncomfortable when using them for a long time. But the awkwardness isn’t very salient or obvious (for most people), because we just automatically adjust position to handle it, and the discomfort only comes much later from holding that awkward position for a long time.
I agree ergonimics can be hard to verify. But some ergonomics are easy to verify, and chairs conform to those ergonomics (e.g. having a backrest is good, not having sharp stabby parts are good, etc.).
I mean, sure, for any given X there will be some desirable properties of X which are easy to verify, and it’s usually pretty easy to outsource the creation of an X which satisfies the easy-to-verify properties. The problem is that the easy-to-verify properties do not typically include all the properties which are important to us. Ergonomics is a very typical example.
Extending to AI: sure, there will be some desirable properties of AI which are easy to verify, or properties of alignment research which are easy to verify, or properties of plans which are easy to verify, etc. And it will be easy to outsource the creation of AI/research/plans which satisfy those easy-to-verify properties. Alas, the easy-to-verify properties do not include all the properties which are important to us, or even all the properties needed to not die.
I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think “verification is easier than generation” is an important part of the overall landscape of AI risk.
That is certainly a more directly related, non-obvious aspect for verification. Thanks.
I agree that there are some properties of objects that are hard to verify. But that doesn’t mean generation is as hard as verification in general. The central property of a chair (that you can sit on it) is easy to verify.
This feels more like an argument that Wentworth’s model is low-resolution than that he’s actually misidentified where the disagreement is?
I’m curious what you think of Paul’s points (2) and (3) here:
And specifically to what degree you think future AI systems will make “major technical contributions” that are legible to their human overseers before they’re powerful enough to take over completely.
You write:
But how likely do you think it is that these OOM jumps happen before vs. after a decisive loss of control?
My own take: I think there will probably be enough selection pressure and sophistication in primarily human-driven R&D processes alone to get to uncontrollable AI. Weak AGIs might speed the process along in various ways, but by the time an AI itself can actually drive the research process autonomously (and possibly make discontinuous progress), the AI will already also be capable of escaping or deceiving its operators pretty easily, and deception / escape seems likely to happen first for instrumental reasons.
But my own view isn’t based on the difficulty of verification vs. generation, and I’m not specifically skeptical of bureaucracies / delegation. Doing bad / fake R&D that your overseers can’t reliably check does seem somewhat easier than doing real / good R&D, but not always, and as a strategy seems like it would usually be dominated by “just escape first and do your own thing”.
But, I think the negative impacts that these goods have on you are (mostly) realized on longer timescales—say, years to decades. If you’re using a chair that is bad for your posture, the impacts of this are usually seen years down the line when your back starts aching. Or if you keep microwaving tupperware, you may end up with some pretty nasty medical problems, but again, decades down the line.
The property of an action having long horizons until it can be verified as good or bad for you makes delegating to smarter-than-you systems dangerous. My intuition is that there are lots of tasks that could significantly accelerate alignment research that don’t have this property, examples being codebase writing (unit tests can provide quick feedback), proof verification etc. In fact, I can’t think of many research tasks in technical fields that have month/year/decade horizons until they can be verified—though maybe I’ve just not given it enough thought.
Many research tasks have very long delays until they can be verified. The history of technology is littered with apparently good ideas that turned out to be losers after huge development efforts were poured into them. Supersonic transport, zeppelins, silicon-on-sapphire integrated circuits, pigeon-guided bombs, object-oriented operating systems, hydrogenated vegetable oil, oxidative decoupling for weight loss…
Finding out that these were bad required making them, releasing them to the market, and watching unrecognized problems torpedo them. Sometimes it took decades.
But if the core difficulty in solving alignment is developing some difficult mathematical formalism and figuring out relevant proofs then I think we won’t suffer from the problems with the technologies above. In other words, I would feel comfortable delegating and overseeing a team of AIs that have been tasked with solving the Riemann hypothesis—and I think this is what a large part of solving alignment might look like.
“May it go from your lips to God’s ears,” as the old Jewish saying goes. Meaning, I hope you’re right. Maybe aligning superintelligence will largely be a matter of human-checkable mathematical proof.
I have 45 years experience as a software and hardware engineer, which makes me cynical. When one of my designs encounters the real world, it hardly ever goes the way I expect. It usually either needs some rapid finagling to make it work (acceptable) or it needs to be completely abandoned (bad). This is no good for the first decisive try at superalignment; that has to work first time. I hope our proof technology is up to it.
This does not agree with my understanding of what HCH is at all. HCH is a definition of an abstract process for thought experiments, much like AIXI is. It’s defined as the fixed point of some iterative process of delegation expanding out into a tree. It’s also not something you could actually implement, but it’s a platonic form like “circle” or “integral”.
This has nothing to do with the way an HCH-like process would be implemented. You could easily have something that’s designed to mimic HCH but it’s implemented as a single monolithic AI system.
As you’re doing these delta posts, do you feel like it’s changing your own positions at all?
For example, reading this one what strikes me is that what’s portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that’s unlikely to be uniform across different types of problems.
To my eyes the most likely outcome is a situation where you are both right.
Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification will be hard and unsupervised delegation will be catastrophic (cough glue on pizza).
If we are only rolling things up into aggregate pictures of the average case across all problems, I can see the discussion filtering back into those two distinct deltas, but a bit like flip-flops and water bottles, the lack of nuance obscures big picture decision making.
So I’m curious if as you explore and represent the opposing views to your own, particularly as you seem to be making effort to represent without depicting them as straw person arguments, if your own views have been deepening and changing through the process?
Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I’ve been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I’ve long since updated on, not so much figuring out new stuff.
In terms of the hard to verify aspect, while it’s true that any one person will face any number of challenges do we live in a world where one person does anything on their own?
How would the open-source model influence outcomes? When pretty much anyone can take a look, and persumable many do, does the level of verifcation, or ease of verification, improve in your model?
Crucially, this is true only because you’re relatively smart for a human: smarter than many of the engineers that designed those objects, and smarter than most or all of the committee-of-engineers that designed those objects. You can come up with better solutions then they did, if you have a similar level of context.
But that’s not true of most humans. Most humans, if they did a deep dive into those objects wouldn’t notice the many places where there is substantial room for improvement. Just like most humans don’t spontaneously recognized blatant-to-me incentive problems in government design (and virtually every human institution), and just as I often wouldn’t be able to tell that a software solution was horrendously badly architected, at least without learning learning a bunch of software engineering in addition to doing a deep dive into this particular program.
Note that to the extent this is true, it suggests verification is even harder than John thinks.
Hmm, not exactly. Our verification ability only needs to be sufficiently good relative to the AIs.
Ehh, yes and no. I maybe buy that a median human doing a deep dive into a random object wouldn’t notice the many places where there is substantial room for improvement; hanging around with rationalists does make it easy to forget just how low the median-human bar is.
But I would guess that a median engineer is plenty smart enough to see the places where there is substantial room for improvement, at least within their specialty. Indeed, I would guess that the engineers designing these products often knew perfectly well that they were making tradeoffs which a fully-informed customer wouldn’t make. The problem, I expect, is mostly organizational dysfunction (e.g. the committee of engineers is dumber than one engineer, and if there are any nontechnical managers involved then the collective intelligence nosedives real fast), and economic selection pressure.
For instance, I know plenty of software engineers who work at the big tech companies. The large majority of them (in my experience) know perfectly well that their software is a trash fire, and will tell you as much, and will happily expound in great detail the organizational structure and incentives which lead to the ongoing trash fire.
Is there an opposite of the “failure of ease of verification” that would add up to 100% if you would categorize the whole of reality into 1 of these 2 categories? Say in a simulation, if you attributed every piece of computation into following 2 categories, how much of the world can be “explained by” each category?
make sure stuff “works at all and is easy to verify whether it works at all”
stuff that works must be “potentially better in ways that are hard to verify”
Examples:
when you press the “K” key on your keyboard for 1000 times, it will launch nuclear missiles ~0 times and the K key will “be pressed” ~999 times
when your monitor shows you the pixels for a glyph of the letter “K” 1000 times, it will represent the planet Jupyter ~0 times and “there will be” the letter K ~999 times
in each page in your stack of books, the character U+0000 is visible ~0 times and the letter A, say ~123 times
tupperware was your own purchase and not gifted by a family member? I mean, for which exact feature would you pay how much more?!?
you can tell whether a water bottle contains potable water and not sulfuric acid
carpet, desk, and chair haven’t spontaneously combusted (yet?)
the refrigerator doesn’t produce any black holes
(flip-flops are evil and I don’t want to jinx any sinks at this time)
Based on my 1 deep dive on pens a few years ago this seems true. Maybe that is too high dimensional and too unfocused a post, but maybe there should be a post on “best X of every common product people use every day”? And then we somehow filter for people with actual expertise? Like for pens you want to go with the recommendations of “the pen addict”.
The issue there is that “best X” varies wildly depending on purpose, budget and usage.
Take a pen: For me, I mostly keep pens in my bag to make quick notes and lend out. The overriding concern is that the pens are very cheap, can be visually checked whether full or empty, and never leak, because they will spend a lot of time bouncing around in my bag, and I am unlikely to get them back when loaned.
A calligrapher has very different requirements.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?