This is mostly in response to stuff written by Richard, but I’m interested in everyone’s read of the situation.
While I don’t find Eliezer’s core intuitions about intelligence too implausible, they don’t seem compelling enough to do as much work as Eliezer argues they do. As in the Foom debate, I think that our object-level discussions were constrained by our different underlying attitudes towards high-level abstractions, which are hard to pin down (let alone resolve).
Given this, I think that the most productive mode of intellectual engagement with Eliezer’s worldview going forward is probably not to continue debating it (since that would likely hit those same underlying disagreements), but rather to try to inhabit it deeply enough to rederive his conclusions and find new explanations of them which then lead to clearer object-level cruxes.
I’m not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he’s actively trying to be coherent / explain everything. [This is what I think you’re pointing to with his ‘attitude towards high-level abstractions’.]
When I read other people, I often feel like they’re operating in a ‘narrower segment of their model’, or not trying to fit the whole world at once, or something. They often seem to emit sentences that are ‘not absurd’, instead of ‘on their mainline’, because they’re mostly trying to generate sentences that pass some shallow checks instead of ‘coming from their complete mental universe.’
Why is this?
Just a difference in articulation or cultural style? (Like, people have complete mental models, they just aren’t as good at or less interested in exposing the pieces as Eliezer is.)
A real difference in functioning? (Certainly there are sentences that I emit which are not ‘on my mainline’, because I’m trying to achieve some end besides the ‘predict the world accurately’ end, and while I think my mental universe has lots of detail and models I don’t have the sense that it’s as coherent as Eliezer’s mental universe.)
The thing I think is happening with Eliezer is illusory? (In fact he’s operating narrow models like everyone else, he just has more confidence that those models apply broadly.)
I notice I’m still a little stuck on this comment from earlier, where I think Richard had a reasonable response to my complaint on the object-level (indeed, strong forces opposed to technological progress makes sense, as do them not necessarily being rational or succeeding in every instance), but there’s still some meta-level mismatch. Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with ‘since you’re being overly pessimistic, I will be overly optimistic to balance’, with no attempt to have his response match his own mainline. And then when Eliezer responded with:
But there’s a really really basic lesson here about the different style of “sentences found in political history books” rather than “sentences produced by people imagining ways future politics could handle an issue successfully”.
the subject got changed.
But I’m still deeply interested in the really really basic lesson, and how deeply it’s been grokked by everyone involved!
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer’s, albeit probabilistic ones, rather than bailing with “the future is hard to predict”). At a high level I don’t think “mainline” is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what “mainline” means), and that neat stories that fit everything usually don’t work well (unless, or often even if, generated in hindsight).
In answer to your “why is this,” I think it’s a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I’m somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that’s probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).
I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer’s, albeit probabilistic ones, rather than bailing with “the future is hard to predict”).
Sometimes I’ll be tracking a finite number of “concrete hypotheses”, where every hypothesis is ‘fully fleshed out’, and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I’m moderately confident that every ‘hypothesis’ corresponds to a ‘real world’, constrained by how well as I can get my imagination to correspond to reality. [A ‘finite number’ depends on the situation, but I think it’s normally something like 2-5, unless it’s an area I’ve built up a lot of cache about.]
Sometimes I’ll be tracking a bunch of “surface-level features”, where the distributions on the features don’t always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability that a random number is odd and a different guess about the probability that a random number is divisible by 3 and, until I deliberately consider the joint probability distribution, not have any guarantee that it’ll be coherent.)
Normally I’m doing something more like a mixture of those, which I think of as particles of incomplete world models, with some features pinned down and others mostly ‘surface-level features’. I can often simultaneously consider many more of these; like, when I’m playing Go, I might be tracking a dozen different ‘lines of attack’, which have something like 2-4 moves clearly defined and the others ‘implied’ (in a way that might not actually be consistent).
Are any of those like your experience? Or is there some other way you’d describe it?
different ideas about what kinds of standards discourse should aspire to
Have you written about this / could you? I’d be pretty excited about being able to try out discoursing with people in a Paul-virtuous way.
I think my way of thinking about things is often a lot like “draw random samples,” more like drawing N random samples rather than particle filtering (I guess since we aren’t making observations as we go—if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it’s impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.
I feel like a distinctive feature of Eliezer’s dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like “is this property of the described world likely?” which is tough because obviously everyone agrees that every particular world is unlikely. To Eliezer it seems obvious that the feature is improbable (because it was just produced by seeing where the world violated the strong regularity he believes in), whereas to the other person it just looks like one of many scenarios that is implausible only in its concrete details. And then this isn’t well-resolved by “just talk about your mainline” because the “mainline” is a distribution over worlds which are all individually improbable (for either Eliezer or for others).
This is all a bit of a guess though / rambling speculation.
I think my way of thinking about things is often a lot like “draw random samples,” more like drawing N random samples rather than particle filtering (I guess since we aren’t making observations as we go—if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
Oh whoa, you don’t remember your samples from before? [I guess I might not either, unless I’m concentrating on keeping them around or verbalized them or something; probably I do something more expert-iteration-like where I’m silently updating my generating distributions based on the samples and then resampling them in the future.]
To Eliezer it seems obvious that the feature is improbable (because it was just produced by seeing where the world violated the strong regularity he believes in), whereas to the other person it just looks like one of many scenarios that is implausible only in its concrete details. And then this isn’t well-resolved by “just talk about your mainline” because the “mainline” is a distribution over worlds which are all individually improbable (for either Eliezer or for others).
Yeah, this seems likely; this makes me more interested in the “selectively ignoring variables” hypothesis for why Eliezer running this strategy might have something that would naturally be called a mainline. [Like, it’s very easy to predict “number of apples sold = number of apples bought” whereas it’s much harder to predict the price of apples.] But maybe instead he means it in the ‘startup plan’ sense, where you do actually assign basically no probability to your mainline prediction, but still vastly more than any other prediction that’s equally conjunctive.
EDIT: I wrote this before seeing Paul’s response; hence a significant amount of repetition.
They often seem to emit sentences that are ‘not absurd’, instead of ‘on their mainline’, because they’re mostly trying to generate sentences that pass some shallow checks instead of ‘coming from their complete mental universe.’
Why is this?
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like “in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe”, I’m obviously not claiming that this is a realistic thing that I expect to happen, so it’s not coming from my “complete mental universe”; I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say “maybe X happens”, or “X is not absurd”, I’m saying that my probability distribution assigns non-trivial probability to futures in which X happens. Notably, this is absolutely “coming from my complete mental universe”—the probability distribution is all there is, there’s no extra constraints that take 5% probabilities and drive them down to 0, or whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over. In that setting “talk about the mainline” sounds like “keep just one hypothesis and talk about what it says”; this is not going to give you good results. Maybe more charitably it’s “one hypothesis is going to stably get >50% probability and so you should think about that hypothesis a lot” but I don’t see why that should be true.
Obviously some things do in fact get > 90% probability; if you ask me questions like “what’s the probability that if it rains the sidewalk will be wet” I will totally have a mainline, and there will be edge cases like “what if the rain stopped at the boundary between the sidewalk and the road” but those will be mostly irrelevant. The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I presume that there is an underlying unvoiced argument that goes “Rohin, you say that you have a probability distribution over futures; that implies that you have many, many different consistent worlds in mind, and you are uncertain about which one we’re in, and when you are asked for the probability of X then you sum probabilities across each of the worlds where X holds. This seems wild; it’s such a ridiculously complicated operation for a puny human brain to implement; there’s no way you’re doing this. You’re probably just implementing some simpler heuristic where you look at some simple surface desideratum and go ‘idk, 30%’ out of modesty.”
Obviously I do not literally perform the operation described above, like any bounded agent I have to approximate the ideal. But I do not then give up and say “okay, I’ll just think about a single consistent world and drop the rest of the distribution”, I do my best to represent the full range of uncertainty, attempting to have all of my probabilities on events ground out in specific worlds that I think are plausible, think about some specific worlds in greater detail to see what sorts of correlations arise between different important phenomena, carry out some consistency checks on the probabilities I assign to events to notice cases where I’m clearly making mistakes, etc. I don’t see why “have a mainline” is obviously a better response to our boundedness than the approach I use (if anything, it seems obviously a worse response).
In response to your last couple paragraphs: the critique, afaict, is not “a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those”, but rather “a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd).”
On my understanding of Eliezer’s picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that?
I’ll try to explain the technique and why it’s useful. I’ll start with a non-probabilistic version of the idea, since it’s a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I’m building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy—think energy conservation, or Newton’s Laws, or market efficiency, depending on what kind of systems we’re talking about. My hope/plan is to derive (i.e. prove) some predictions from these conditions, or maybe prove some of the conditions from others.
Before I go too far down the path of proving things from the conditions, I’d like to do a quick check that my conditions are consistent at all. How can I do that? Well, human brains are quite good at constrained optimization, so one useful technique is to look for one example of a system which satisfies all the conditions. If I can find one example, then I can be confident that the conditions are at least not inconsistent. And in practice, once I have that one example in hand, I can also use it for other purposes: I can usually see what (possibly unexpected) degrees of freedom the conditions leave open, or what (possibly unexpected) degrees of freedom the conditions don’t leave open. By looking at that example, I can get a feel for the “directions” along which the conditions do/don’t “lock in” the properties of the system.
(Note that in practice, we often start with an example to which we want our conditions to apply, and we choose the conditions accordingly. In that case, our one example is built in, although we do need to remember the unfortunately-often-overlooked step of actually checking what degrees of freedom the conditions do/don’t leave open to the example.)
What would a probabilistic version of this look like? Well, we have a world model with some (uncertain) constraints in it—i.e. kinds-of-things-which-tend-to-happen, and kinds-of-things-which-tend-to-not-happen. Then, we look for an example which generally matches the kinds-of-things-which-tend-to-happen. If we can find such an example, then we know that the kinds-of-things-which-tend-to-happen are mutually compatible; a high probability for some of them does not imply a low probability for others. With that example in hand, we can also usually recognize which features of the example are very-nailed-down by the things-which-tend-to-happen, and which features have lots of freedom. We may, for instance, notice that there’s some very-nailed-down property which seems unrealistic in the real world; I expect that to be the most common way for this technique to unearth problems.
That’s the role a “mainline” prediction serves. Note that it does not imply the mainline has a high probability overall, nor does it imply a high probability that all of the things-which-tend-to-happen will necessarily occur simultaneously. It’s checking whether the supposed kinds-of-things-which-tend-to-happen are mutually consistent with each other, and it provides some intuition for what degrees of freedom the kinds-of-things-which-tend-to-happen do/don’t leave open.
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that’s because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I’d actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)
To be clear, I do not mean to use the label “mainline prediction” for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.
(Also, it matches up quite well with Nate’s model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)
The technique you described is in fact very useful
If your probability distribution over futures happens to be such that it has a “mainline prediction”, you get significant benefits from that (similar to the benefits you get from the technique you described).
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant?
Uh, I inherited “mainline” from Eliezer’s usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John’s. My natural word for it is a ‘particle’, from particle filtering, as linked in various places, which I think is consistent with John’s description. I’m further guessing that Eliezer’s noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates ‘mainline predictions’ than ‘probability distributions’.
If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy.
Do you feel like you do this ‘sometimes’, or ‘basically always’? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they’re coming from this vs. some other source.
As a specific thing that I have in mind, I think there’s a habit of thinking / discourse that philosophy trains, which is having separate senses for “views in consideration” and “what I believe”, and thinking that statements should be considered against all views in consideration, even ones that you don’t believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself “what would Carol say in response to that?” to things that Bob says to her).
When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the ‘represent the whole discourse’ thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the ‘represent the whole discourse’ thing, of course.]
Do you feel like you do this ‘sometimes’, or ‘basically always’?
I don’t know what “this” refers to. If the referent is “have a concrete example in mind”, then I do that frequently but not always. I do it a ton when I’m not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly run through an execution trace to ensure my program would work, now I do not.)
If the referent is “make statements that reflect my beliefs”, then it depends on context, but in the context of these dialogues, I’m always doing that. (Whereas when I’m writing for the newsletter, I’m more often trying to represent the whole discourse, though the “opinion” sections are still entirely my beliefs.)
whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
I think this is roughly how I’m thinking about things sometimes, tho I’d describe the mainline as the particle with plurality weight (which is a weaker condition than >50%). [I don’t know how Eliezer thinks about things; maybe it’s like this? I’d be interested in hearing his description.]
I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with “the future is hard to predict”, it’s because the hypotheses/particles I’m considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about.
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over.
I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on ‘updating’, and have lots of distributions that aren’t grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on ‘hypotheses’, and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in.
[Like, a thing I find helpful to think about here is where the motive power from Aumann’s Agreement Theorem comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and update on your private info and that my private info is consistent with 40% and your posterior is 60% A, then I update to 48% A, that’s what happened when I further conditioned on knowing that your private info is consistent with that update, and so on. Like we both have to be manipulating functions on the whole shared prior for every update!]
For what it’s worth, I think both styles are pretty useful in the appropriate context. [I am moderately confident this is a situation where it’s worth doing the ‘grounded-in-reality’ particle-filtering approach, i.e. hitting the ‘be concrete’ and ‘be specific’ buttons over and over, and then once you’ve built out one hypothesis doing it again with new samples.]
The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I don’t think I believe the ‘should always have a mainline’ thing, but I do think I want to defend the weaker claim of “it’s worth having a mainline about this.” Like, I think if you’re starting a startup, it’s really helpful to have a ‘mainline plan’ wherein the whole thing actually works, even if you ascribe basically no probability to it going ‘exactly to plan’. Plans are useless, planning is indispensable.
[Also I think it’s neat that there’s a symmetry here about complaining about the uncertainty of the future, which makes sense if we’re both trying to hold onto different pieces of Bayesianism while looking at the same problem.]
If you define “mainline” as “particle with plurality weight”, then I think I was in fact “talking on my mainline” at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about “one of my top 10 particles”.
I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.
I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
Huh, I guess I don’t believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it’s thinking, and yet I remain a black box to myself, I can’t really tell whether or not it’s aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”—but I think I have some more-obviously-alignment-related counterexamples. [Tho if by ‘infinite oversight ability’ you do mean something like ‘logical omniscience’ it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.]
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
(This is a long response to a short section because I think the disagreement here is about something like “how should we reason and communicate about intuitions?”, and so it’s worth expanding on what I think might be the implications of otherwise minor disagreements.)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I am confused what you think I was trying to do with that intuition pump.
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)
The most recent post has a related exchange between Eliezer and Rohin:
Eliezer: I think the critical insight—though it has a format that basically nobody except me ever visibly invokes in those terms, and I worry maybe it can only be taught by a kind of life experience that’s very hard to obtain—is the realization that any consistent reasonable story about underlying mechanisms will give you less optimistic forecasts than the ones you get by freely combining surface desiderata
Rohin: Yeah, I think I do not in fact understand why that is true for any consistent reasonable story.
If I’m being locally nitpicky, I argue that Eliezer’s thing is a very mild overstatement (it should be “≤” instead of “<”) but given that we’re talking about forecasts, we’re talking about uncertainty, and so we should expect “less” optimism instead of just “not more” optimism, and so I think Eliezer’s statement stands as a general principle about engineering design.
This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there’s some deep confusion of mine that I’d like cleared up.
Question to Eliezer: would you agree with the gist of the following? And if not, any thoughts on what lead to a strong sense of ‘coherence in your worldview’ as Vaniver put it?
Vaniver, I feel like you’re pointing at something that I’ve noticed as well and am interested in too (the coherence of Eliezer’s worldview as you put it). I wonder if has something to do with not going to uni but building his whole worldview all by him self. In my experience uni often tends towards to cramming lots of facts which are easily testable on exams, with less emphasis on understanding underlying principles (which is harder to test with multiple choice questions). Personally I feel like I had to spend my years after uni trying to make sense, a coherent whole if you like, of all the separate things I’ve learned while in uni where things were mostly just kind of put out there without constantly integrating things. Perhaps if you start out thinking much more about underlying principles earlier on it’s easier to integrate all the separate facts into a coherent whole as you go along. Not sure if Eliezer would agree with this. Maybe it’s even much more basic and he just always had a very strong sense of dissatisfaction if he couldn’t make things cohere into a whole and this urge for things to make sense was much more important than self studying or thinking about underlying principles before and then during the learning of new knowledge...
I would like to point out a section in the latest Shay/Yudkowsky dialogue where Eliezer says some things about this topic, does this feel like it’s the same thing you are talking about Vaniver?
Eliezer: “So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.
This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all. Richard Feynman—or so I would now say in retrospect—is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about “essential objects” in class, and Feynman asks “Is a brick an essential object?”—meaning to work up to the notion of the inside of a brick, which can’t be observed because breaking a brick in half just gives you two new exterior surfaces—and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object.
Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn’t, and I think the mental motion has died out of the world even further since Feynman wrote about it. The loss has spread to STEM as well. Though if you don’t read old books and papers and contrast them to new books and papers, you wouldn’t see it, and maybe most of the people who’ll eventually read this will have no idea what I’m talking about because they’ve never seen it any other way...
I have a thesis about how optimism over AGI works. It goes like this: People use really abstract descriptions and never imagine anything sufficiently concrete, and this lets the abstract properties waver around ambiguously and inconsistently to give the desired final conclusions of the argument. So MIRI is the only voice that gives concrete examples and also by far the most pessimistic voice; if you go around fully specifying things, you can see that what gives you a good property in one place gives you a bad property someplace else, you see that you can’t get all the properties you want simultaneously.”
(For the reader, I don’t think that “arguments about what you’re selecting for” is the same thing as “freely combining surface desiderata”, though I do expect they look approximately the same to Eliezer)
and my immediately preceding message was
I actually think something like this might be a crux for me, though obviously I wouldn’t put it the way you’re putting it. More like “are arguments about internal mechanisms more or less trustworthy than arguments about what you’re selecting for” (limiting to arguments we actually have access to, of course in the limit of perfect knowledge internal mechanisms beats selection). But that is I think a discussion for another day.
I think I was responding to the version of the argument where “freely combining surface desiderata” was swapped out with “arguments about what you’re selecting for”. I probably should have noted that I agreed with the basic abstract point as Eliezer stated it; I just don’t think it’s very relevant to the actual disagreement.
I think my complaints in the context of the discussion are:
It’s a very weak statement. If you freely combine the most optimistic surface desiderata, you get ~0% chance of doom. My estimate is way higher (in odds-space) than ~0%, and the statement “p(doom) >= ~0%” is not that interesting and not a justification of “doom is near-inevitable”.
Relatedly, I am not just “freely combining surface desiderata”. I am doing something like “predicting what properties AI systems would have by reasoning about what properties we selected for during training”. I think you could reasonably ask how that compares against “predicting what properties AI systems would have by reasoning about what mechanistic algorithms could produce the behavior we observed during training”. I was under the impression that this was what Eliezer was pointing at (because that’s how I framed it in the message immediately prior to the one you quoted) but I’m less confident of that now.
Sorry, I probably should have been more clear about the “this is a quote from a longer dialogue, the missing context is important.” I do think that the disagreement about “how relevant is this to ‘actual disagreement’?” is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you’re right that the thing you’re doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you’re selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there’s 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.
To use an example that makes me look bad, there was a time when I didn’t believe Arrow’s Impossibility Theorem because I was using the ‘freely combine surface desiderata’ strategy. The comment that snapped me out of it involved having to actually write out the whole voting rule, and see that I couldn’t instantiate the thing I thought I could instantiate.
As a more AI-flavored example, I was talking last night with Alex about ELK, specifically trying to estimate the relative population of honest reporters and dishonest reporters in the prior implied by the neural tangent kernel model, and he observed that if you had a constructive approach of generating initializations that only contained honest reporters, that might basically solve the ELK problem; after thinking about it for a bit I said “huh, that seems right but I’m not sure it’s possible to do that, because maybe any way to compose an honest reporter out of parts gives you all of the parts you need to compose a dishonest reporter.”
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there’s 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.
I agree that if you have a choice about whether to have more or fewer free parameters, all else equal you should prefer the model with fewer free parameters. (Obviously, all else is not equal; in particular I do not think that Eliezer’s model is tracking reality as well as mine.)
When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism. I know Eliezer thinks I have such a bias. I disagree with him.
I think arguments about what you’re selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
I agree that this is true in some platonic sense. Either the argument gives me a correct answer, in which case I have true statements that could be cashed out in terms of mechanistic algorithms, or the argument gives me a wrong answer, in which case it wouldn’t be derivable from mechanistic algorithms, because the mechanistic algorithms are the “ground truth”.
Quoting myself from the dialogue:
(limiting to arguments we actually have access to, of course in the limit of perfect knowledge internal mechanisms beats selection)
When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism.
That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like “not actually considering the space”, or “wasting consideration on easily-known-bad parts of the space”, and more constraints could help with both of those. But, as math, removing constraints can’t lower the volume of the implied space and so can’t make it less likely that a viable solution exists.
I know Eliezer thinks I have such a bias. I disagree with him.
I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it’s a reasonable suspicion for anyone.
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
I agree that this is true in some platonic sense.
Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking “yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate” and me as thinking “huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers.”]
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They’d go around telling me “Ray, you’re exhibiting that bias right now. Whatever rationalization you’re coming up with right now, it’s not the real reason you’re arguing X.” And I was like “c’mon man. I have a ton of introspective access to myself and I can tell that this ‘rationalization’ is actually a pretty good reason to believe X and I trust that my reasoning process is real.”
But… eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on “is Ray displaying rational thought?”. When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on “does Ray seem biased in this particular way?”.
And both checks totally returned ‘true’, and that was an accurate assessment.
The particular moment where I noticed this metapattern, I’d say my cognition was, say, 65% “good argumentation”, 15% “one particular bias”, “20% other random stuff.” On a different day, it might have been that I was 65% exhibiting the bias and 15%.
None of this is making much claim of what’s likely to be going on in Rohin’s head or Eliezer’s head or whether Eliezer’s conversational pattern is useful, but wanted to flag it as a way people could be talking past each other.
I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math.
I think we’re imagining different toy mathematical models.
Your model, according to me:
There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
We put a layer of abstraction on top of this space, characterizing approaches by N different “features” (e.g. “is it goal-directed”, “is it an oracle”, “is it capable of destroying the world”)
Because we’re bounded agents, we then treat the features as independent, and search for some combination of features that would comprise a solution.
I agree that this procedure has a systematic error in claiming that there is a solution when none exists (and doesn’t have the opposite error), and that if this were an accurate model of how I was reasoning I should be way more worried about correcting for that problem.
My model:
There is a probability distribution over “ways the world could be”.
We put a layer of abstraction on top of this space, characterizing “ways the world could be” by N different “features” (e.g. “can you get human-level intelligence out of a pile of heuristics”, “what are the returns to specialization”, “how different will AI ontologies be from human ontologies”). We estimate the marginal probability of each of those features.
Because we’re bounded agents, when we need the joint probability of two or more features, we treat them as independent and just multiply.
Given a proposed solution, we estimate its probability of working by identifying which features need to be true of the world for the solution to work, and then estimate the probability of those features (using the method above).
I claim that this procedure doesn’t have a systematic error in the direction of optimism (at least until you add some additional details), and that this procedure more accurately reflects the sort of reasoning that I am doing.
Huh, why doesn’t that procedure have that systematic error?
Like, when I try to naively run your steps 1-4 on “probability of there existing a number that’s both even and odd”, I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it’ll work out given that there are at least 4 numbers. But I can’t easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, “probability of there existing a number that’s both odd and prime” gives the wrong conclusion if you buy that the probability that a natural number is prime is 0, but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.]
My first guess is that you think I’m doing the “ways the world could be” thing wrong—like, I’m looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on “universe contains a number that is both even and odd” and its complement, as those are the two relevant ways the world can be.
My second guess is that you’ve got a different distribution over target predicates; like, we can just take the complement of my overestimate (“probability of there existing no numbers that are both even and odd”) and call it an underestimate. But I think I’m more interested in ‘overestimating existence’ than ‘underestimating non-existence’. [Is this an example of the ‘additional details’ you’re talking about?]
Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there’s a net bias.
I think if you have a particular number then I’m like “yup, it’s fair to notice that we overestimate the probability that x is even and odd by saying it’s 25%”, and then I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors.
I don’t think I’m doing anything that’s analogous to that. I definitely don’t go around thinking “well, it seems 10% likely that such and such feature of the world holds, and so each alignment scheme I think of that depends on this feature has a 10% chance of working, therefore if I think of 10 alignment schemes I’ve solved the problem”. (I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.)
I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
Cool, I like this example.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors. ... I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.
I think the thing I’m interested in is “what are our estimates of the output of search processes?”. The question we’re ultimately trying to answer with a model here is something like “are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?”
The example of “a number divisible by 2 and a number divisible by 4” is an example of where the consistency of your solution helps you—anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can’t increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can’t be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn’t an unbiased estimator of the joint).
For example, consider this recent analysis of cultured meat, which seems to me to point out a fundamental inconsistency of this type in people’s plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller!
[Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite you can combine massive surface area with tiny volume. But if you need massive volume and tiny surface area, that’s not possible. Anyway, in this case, my read is that both of these are based off of volume, and so there’s no clever technique like that available.]
Maybe you could step me thru how your procedure works for estimating the viability of cultured meat, or the possibility of constructing a room temperature <10 atm superconductor, or something?
It seems to me like there’s a version of your procedure which, like, considers all of the different possible factory designs, applies some functions to determine the high-level features of those designs (like profitability, amount of platinum they consume, etc.), and then when we want to know “is there a profitable cultured meat factory?” responds with “conditioning on profitability > 0, this is the set of possible designs.” And then when I ask “is there a profitable cultured meat factory using less than 1% of the platinum available on Earth?” says “sorry, that query is too difficult; I calculated the set of possible designs conditioned on profitability, calculated the set of possible designs conditioned on using less than 1% of the platinum available on Earth, and then <multiplied sets together> to give you this approximate answer.”
But of course that’s not what you’re doing, because the boundedness prevents you from considering all the different possible factory designs. So instead you have, like, clusters of factory designs in your map? But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details? [Or if they did fully populate the details for some limited number of considered objects, how do you back out the implied probability distribution over the non-considered objects in a way that isn’t subject to this?]
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn’t know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought “it’s highly improbable for an algorithm to demonstrate all of these features”. My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.
so the product isn’t an unbiased estimator of the joint
Tbc, I don’t want to claim “unbiased estimator” in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don’t have. I’m more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don’t expect the bias to be in a predictable direction.
But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details?
They definitely have that problem. I’m not sure how you don’t have that problem; you’re always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can’t “fully populate the details” as an embedded agent.
If you’re asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).
I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what’s happening.
Most of the “objects” are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for “number of AGI orgs with a shot at time X”, “fraction of people who agree alignment is a problem”, “amount of optimization pressure needed to avoid deception”, etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.
When you have lots of time, you can improve your reasoning in many different ways:
You can find other factors that seem important, add them in, subdividing worlds out even further.
You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
You can refactor your “main factors” to be more independent of each other. For example, maybe you notice that all of your reasoning about things like “<metric> at time X” depends a lot on timelines, and so you instead replace them with factors like “<metric> at X years before crunch time”, where they are more independent of timelines.
When I read other people, I often feel like they’re operating in a ‘narrower segment of their model’, or not trying to fit the whole world at once, or something. They often seem to emit sentences that are ‘not absurd’, instead of ‘on their mainline’, because they’re mostly trying to generate sentences that pass some shallow checks instead of ‘coming from their complete mental universe.’
To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you’re overconfident about your understanding of that domain. So I don’t think it helps distinguish those two cases.
(Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?)
when Eliezer responded with:
But there’s a really really basic lesson here about the different style of “sentences found in political history books” rather than “sentences produced by people imagining ways future politics could handle an issue successfully”.
the subject got changed.
I believe this was discussed further at some point—I argued that Eliezer-style political history books also exclude statements like “and then we survived the cold war” or “most countries still don’t have nuclear energy”.
Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?
It feels similar but clearly distinct? Like, in that situation Eliezer often seems to say things that I parse as “I don’t have any special knowledge here”, which seems like a different thing than “I can’t easily sample from my distribution over how things go right”, and I also have the sense of Paul being willing to ‘go specific’ and Eliezer not being willing to ‘go specific’.
I believe this was discussed further at some point—I argued that Eliezer-style political history books also exclude statements like “and then we survived the cold war” or “most countries still don’t have nuclear energy”.
I think I’m a little cautious about this line of discussion, because my model doesn’t strongly constrain the ways that different groups respond to increasing developments in AI. The main thing I’m confident about is that there will be much clearer responses available to us once we have a better picture of AI development.
(Or maybe a bit earlier and later, but that was my best guess for where to start the context.)
The main quotes from the middle that seems relevant:
[Ngo][18:19, moved two down in log]
(As a side note, I think that if Eliezer had been around in the 1930s, and you described to him what actually happened with nukes over the next 80 years, he would have called that “insanely optimistic”.)
[Yudkowsky][18:21]
Mmmmmmaybe. Do note that I tend to be more optimistic than the average human about, say, global warming, or everything in transhumanism outside of AGI.
Nukes have going for them that, in fact, nobody has an incentive to start a global thermonuclear war. Eliezer is not in fact pessimistic about everything and views his AGI pessimism as generalizing to very few other things, which are not, in fact, as bad as AGI.
[Yudkowsky][18:22]
But yeah, compared to pre-1946 history, nukes actually kind of did go really surprisingly well!
Like, this planet used to be a huge warring snakepit of Great Powers and Little Powers and then nukes came along and people actually got serious and decided to stop having the largest wars they could fuel.
and ending with:
[Yudkowsky][18:38]
And Eliezer is capable of being less concerned about things when they are intrinsically less concerning, which is why my history does not, unlike some others in this field, involve me running also being Terribly Concerned about nuclear war, global warming, biotech, and killer drones.
Rereading that section, my sense is that it reads like a sort of mirror of the Eliezer->Paul “I don’t know how to operate your view” section; like, Eliezer can say “I think nukes are less worrying for reasons ABC, also you can observe me being not worried about other things-people-are-concerned-by XYZ”, but I wouldn’t have expected you (or the reader who hasn’t picked up Eliezer-thinking from elsewhere) to have been able to come away from that with why you trying to be Eliezer from 1930s would have thought ‘and then it turned out okay’ would have been a political-history-book-sentence, or the relative magnitudes of the surprise. [Like, I think my 1930s-Eliezer puts like 3-30% on “and then it turned out okay” for nukes, and my 2020s-Eliezer puts like 0.03-3% on that for AGI? But it’d be nice to hear if Eliezer thinks AGI turning out as well as nukes is like 10x the surprise of nukes turning out this well conditioned on pre-1930s, or more like 1000x the surprise.]
This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:
And I’ll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.
That is all.
(Obviously there’s a kinda superficial resemblance here to the phenomenon of “calling out” somebody else; I want to state outright that this is not the intention, it’s just that I saw your comment right after seeing Rohin’s comment, in such a way that my memory of his remark was still salient enough that the connection jumped out at me. Since salient observations tend to fade over time, I wanted to put this down before that happened.)
Yeah, I’m also interested in the question of “how do we distinguish ‘sentences-on-mainline’ from ‘shoring-up-edge-cases’?”, or which conversational moves most develop shared knowledge, or something similar.
Like I think it’s often good to point out edge cases, especially when you’re trying to formalize an argument or look for designs that get us out of this trap. In another comment in this thread, I note that there’s a thing Eliezer said that I think is very important and accurate, and also think there’s an edge case that’s not obviously handled correctly.
But also my sense is that there’s some deep benefit from “having mainlines” and conversations that are mostly ‘sentences-on-mainline’? Or, like, there’s some value to more people thinking thru / shooting down their own edge cases (like I do in the mentioned comment), instead of pushing the work to Eliezer. I’m pretty worried that there are deeply general reasons to expect AI alignment to be extremely difficult, people aren’t updating on the meta-level point and continue to attempt ‘rolling their own crypto’, asking if Eliezer can poke the hole in this new procedure, and if Eliezer ever decides to just write serial online fiction until the world explodes humanity hasn’t developed enough capacity to replace him.
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
there’s some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.
people aren’t updating on the meta-level point and continue to attempt ‘rolling their own crypto’, asking if Eliezer can poke the hole in this new procedure
This makes it sound like the rest of us don’t try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.
Whereas in reality, I try to break my proposals, don’t agree with Eliezer’s diagnoses of the problems, and usually don’t ask Eliezer because I don’t expect his answer to be useful to me (and previously didn’t expect him to respond). I expect this is true of others (like Paul and Richard) as well.
Yeah, sorry about not owning that more, and for the frame being muddled. I don’t endorse the “asking Eliezer” or “agreeing with Eliezer” bits, but I do basically think he’s right about many object-level problems he identifies (and thus people disagreeing with him about that is not a feature) and think ‘security mindset’ is the right orientation to have towards AGI alignment. That hypothesis is a ‘worry’ primarily because asymmetric costs means it’s more worth investigating than the raw probability would suggest. [Tho the raw probability of components of it do feel pretty substantial to me.]
[EDIT: I should say I think ARC’s approach to ELK seems like a great example of “people breaking their own proposals”. As additional data to update on, I’d be interested in seeing, like, a graph of people’s optimism about ELK over time, or something similar.]
But also my sense is that there’s some deep benefit from “having mainlines” and conversations that are mostly ‘sentences-on-mainline’?
I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).
But:
Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with ‘since you’re being overly pessimistic, I will be overly optimistic to balance’, with no attempt to have his response match his own mainline.
I do note that there are some situations where rushing to tell a ‘mainline story’ might be the wrong move:
Maybe your beliefs feel wildly unstable day-to-day—because you’re learning a lot quickly, or because it’s just hard to know how to assign weight to the dozens of different considerations that bear on these questions. Then trying to take a quick snapshot of your current view might feel beside the point.
It might even feel actively counterproductive, like rushing too quickly to impose meaning/structure on data when step one is to make sure you have the data properly loaded up in your head.
Maybe there are many scenarios that seem similarly likely to you. If you see ten very different ways things could go, each with ~10% subjective probability, then picking a ‘mainline’ may be hard, and may require a bunch of arbitrary-feeling choices about which similarities-between-scenarios you choose to pay attention to.
This is mostly in response to stuff written by Richard, but I’m interested in everyone’s read of the situation.
I’m not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he’s actively trying to be coherent / explain everything. [This is what I think you’re pointing to with his ‘attitude towards high-level abstractions’.]
When I read other people, I often feel like they’re operating in a ‘narrower segment of their model’, or not trying to fit the whole world at once, or something. They often seem to emit sentences that are ‘not absurd’, instead of ‘on their mainline’, because they’re mostly trying to generate sentences that pass some shallow checks instead of ‘coming from their complete mental universe.’
Why is this?
Just a difference in articulation or cultural style? (Like, people have complete mental models, they just aren’t as good at or less interested in exposing the pieces as Eliezer is.)
A real difference in functioning? (Certainly there are sentences that I emit which are not ‘on my mainline’, because I’m trying to achieve some end besides the ‘predict the world accurately’ end, and while I think my mental universe has lots of detail and models I don’t have the sense that it’s as coherent as Eliezer’s mental universe.)
The thing I think is happening with Eliezer is illusory? (In fact he’s operating narrow models like everyone else, he just has more confidence that those models apply broadly.)
I notice I’m still a little stuck on this comment from earlier, where I think Richard had a reasonable response to my complaint on the object-level (indeed, strong forces opposed to technological progress makes sense, as do them not necessarily being rational or succeeding in every instance), but there’s still some meta-level mismatch. Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with ‘since you’re being overly pessimistic, I will be overly optimistic to balance’, with no attempt to have his response match his own mainline. And then when Eliezer responded with:
the subject got changed.
But I’m still deeply interested in the really really basic lesson, and how deeply it’s been grokked by everyone involved!
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer’s, albeit probabilistic ones, rather than bailing with “the future is hard to predict”). At a high level I don’t think “mainline” is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what “mainline” means), and that neat stories that fit everything usually don’t work well (unless, or often even if, generated in hindsight).
In answer to your “why is this,” I think it’s a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I’m somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that’s probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).
I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.
Sometimes I’ll be tracking a finite number of “concrete hypotheses”, where every hypothesis is ‘fully fleshed out’, and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I’m moderately confident that every ‘hypothesis’ corresponds to a ‘real world’, constrained by how well as I can get my imagination to correspond to reality. [A ‘finite number’ depends on the situation, but I think it’s normally something like 2-5, unless it’s an area I’ve built up a lot of cache about.]
Sometimes I’ll be tracking a bunch of “surface-level features”, where the distributions on the features don’t always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability that a random number is odd and a different guess about the probability that a random number is divisible by 3 and, until I deliberately consider the joint probability distribution, not have any guarantee that it’ll be coherent.)
Normally I’m doing something more like a mixture of those, which I think of as particles of incomplete world models, with some features pinned down and others mostly ‘surface-level features’. I can often simultaneously consider many more of these; like, when I’m playing Go, I might be tracking a dozen different ‘lines of attack’, which have something like 2-4 moves clearly defined and the others ‘implied’ (in a way that might not actually be consistent).
Are any of those like your experience? Or is there some other way you’d describe it?
Have you written about this / could you? I’d be pretty excited about being able to try out discoursing with people in a Paul-virtuous way.
I think my way of thinking about things is often a lot like “draw random samples,” more like drawing N random samples rather than particle filtering (I guess since we aren’t making observations as we go—if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it’s impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.
I feel like a distinctive feature of Eliezer’s dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like “is this property of the described world likely?” which is tough because obviously everyone agrees that every particular world is unlikely. To Eliezer it seems obvious that the feature is improbable (because it was just produced by seeing where the world violated the strong regularity he believes in), whereas to the other person it just looks like one of many scenarios that is implausible only in its concrete details. And then this isn’t well-resolved by “just talk about your mainline” because the “mainline” is a distribution over worlds which are all individually improbable (for either Eliezer or for others).
This is all a bit of a guess though / rambling speculation.
Oh whoa, you don’t remember your samples from before? [I guess I might not either, unless I’m concentrating on keeping them around or verbalized them or something; probably I do something more expert-iteration-like where I’m silently updating my generating distributions based on the samples and then resampling them in the future.]
Yeah, this seems likely; this makes me more interested in the “selectively ignoring variables” hypothesis for why Eliezer running this strategy might have something that would naturally be called a mainline. [Like, it’s very easy to predict “number of apples sold = number of apples bought” whereas it’s much harder to predict the price of apples.] But maybe instead he means it in the ‘startup plan’ sense, where you do actually assign basically no probability to your mainline prediction, but still vastly more than any other prediction that’s equally conjunctive.
EDIT: I wrote this before seeing Paul’s response; hence a significant amount of repetition.
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like “in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe”, I’m obviously not claiming that this is a realistic thing that I expect to happen, so it’s not coming from my “complete mental universe”; I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say “maybe X happens”, or “X is not absurd”, I’m saying that my probability distribution assigns non-trivial probability to futures in which X happens. Notably, this is absolutely “coming from my complete mental universe”—the probability distribution is all there is, there’s no extra constraints that take 5% probabilities and drive them down to 0, or whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over. In that setting “talk about the mainline” sounds like “keep just one hypothesis and talk about what it says”; this is not going to give you good results. Maybe more charitably it’s “one hypothesis is going to stably get >50% probability and so you should think about that hypothesis a lot” but I don’t see why that should be true.
Obviously some things do in fact get > 90% probability; if you ask me questions like “what’s the probability that if it rains the sidewalk will be wet” I will totally have a mainline, and there will be edge cases like “what if the rain stopped at the boundary between the sidewalk and the road” but those will be mostly irrelevant. The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I presume that there is an underlying unvoiced argument that goes “Rohin, you say that you have a probability distribution over futures; that implies that you have many, many different consistent worlds in mind, and you are uncertain about which one we’re in, and when you are asked for the probability of X then you sum probabilities across each of the worlds where X holds. This seems wild; it’s such a ridiculously complicated operation for a puny human brain to implement; there’s no way you’re doing this. You’re probably just implementing some simpler heuristic where you look at some simple surface desideratum and go ‘idk, 30%’ out of modesty.”
Obviously I do not literally perform the operation described above, like any bounded agent I have to approximate the ideal. But I do not then give up and say “okay, I’ll just think about a single consistent world and drop the rest of the distribution”, I do my best to represent the full range of uncertainty, attempting to have all of my probabilities on events ground out in specific worlds that I think are plausible, think about some specific worlds in greater detail to see what sorts of correlations arise between different important phenomena, carry out some consistency checks on the probabilities I assign to events to notice cases where I’m clearly making mistakes, etc. I don’t see why “have a mainline” is obviously a better response to our boundedness than the approach I use (if anything, it seems obviously a worse response).
In response to your last couple paragraphs: the critique, afaict, is not “a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those”, but rather “a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd).”
On my understanding of Eliezer’s picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
Relevant Feynman quote:
I’ll try to explain the technique and why it’s useful. I’ll start with a non-probabilistic version of the idea, since it’s a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I’m building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy—think energy conservation, or Newton’s Laws, or market efficiency, depending on what kind of systems we’re talking about. My hope/plan is to derive (i.e. prove) some predictions from these conditions, or maybe prove some of the conditions from others.
Before I go too far down the path of proving things from the conditions, I’d like to do a quick check that my conditions are consistent at all. How can I do that? Well, human brains are quite good at constrained optimization, so one useful technique is to look for one example of a system which satisfies all the conditions. If I can find one example, then I can be confident that the conditions are at least not inconsistent. And in practice, once I have that one example in hand, I can also use it for other purposes: I can usually see what (possibly unexpected) degrees of freedom the conditions leave open, or what (possibly unexpected) degrees of freedom the conditions don’t leave open. By looking at that example, I can get a feel for the “directions” along which the conditions do/don’t “lock in” the properties of the system.
(Note that in practice, we often start with an example to which we want our conditions to apply, and we choose the conditions accordingly. In that case, our one example is built in, although we do need to remember the unfortunately-often-overlooked step of actually checking what degrees of freedom the conditions do/don’t leave open to the example.)
What would a probabilistic version of this look like? Well, we have a world model with some (uncertain) constraints in it—i.e. kinds-of-things-which-tend-to-happen, and kinds-of-things-which-tend-to-not-happen. Then, we look for an example which generally matches the kinds-of-things-which-tend-to-happen. If we can find such an example, then we know that the kinds-of-things-which-tend-to-happen are mutually compatible; a high probability for some of them does not imply a low probability for others. With that example in hand, we can also usually recognize which features of the example are very-nailed-down by the things-which-tend-to-happen, and which features have lots of freedom. We may, for instance, notice that there’s some very-nailed-down property which seems unrealistic in the real world; I expect that to be the most common way for this technique to unearth problems.
That’s the role a “mainline” prediction serves. Note that it does not imply the mainline has a high probability overall, nor does it imply a high probability that all of the things-which-tend-to-happen will necessarily occur simultaneously. It’s checking whether the supposed kinds-of-things-which-tend-to-happen are mutually consistent with each other, and it provides some intuition for what degrees of freedom the kinds-of-things-which-tend-to-happen do/don’t leave open.
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that’s because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I’d actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)
To be clear, I do not mean to use the label “mainline prediction” for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.
(Also, it matches up quite well with Nate’s model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)
Ah, got it. I agree that:
The technique you described is in fact very useful
If your probability distribution over futures happens to be such that it has a “mainline prediction”, you get significant benefits from that (similar to the benefits you get from the technique you described).
Uh, I inherited “mainline” from Eliezer’s usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John’s. My natural word for it is a ‘particle’, from particle filtering, as linked in various places, which I think is consistent with John’s description. I’m further guessing that Eliezer’s noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates ‘mainline predictions’ than ‘probability distributions’.
Do you feel like you do this ‘sometimes’, or ‘basically always’? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they’re coming from this vs. some other source.
As a specific thing that I have in mind, I think there’s a habit of thinking / discourse that philosophy trains, which is having separate senses for “views in consideration” and “what I believe”, and thinking that statements should be considered against all views in consideration, even ones that you don’t believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself “what would Carol say in response to that?” to things that Bob says to her).
When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the ‘represent the whole discourse’ thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the ‘represent the whole discourse’ thing, of course.]
I don’t know what “this” refers to. If the referent is “have a concrete example in mind”, then I do that frequently but not always. I do it a ton when I’m not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly run through an execution trace to ensure my program would work, now I do not.)
If the referent is “make statements that reflect my beliefs”, then it depends on context, but in the context of these dialogues, I’m always doing that. (Whereas when I’m writing for the newsletter, I’m more often trying to represent the whole discourse, though the “opinion” sections are still entirely my beliefs.)
I think this is roughly how I’m thinking about things sometimes, tho I’d describe the mainline as the particle with plurality weight (which is a weaker condition than >50%). [I don’t know how Eliezer thinks about things; maybe it’s like this? I’d be interested in hearing his description.]
I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with “the future is hard to predict”, it’s because the hypotheses/particles I’m considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about.
I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on ‘updating’, and have lots of distributions that aren’t grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on ‘hypotheses’, and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in.
[Like, a thing I find helpful to think about here is where the motive power from Aumann’s Agreement Theorem comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and update on your private info and that my private info is consistent with 40% and your posterior is 60% A, then I update to 48% A, that’s what happened when I further conditioned on knowing that your private info is consistent with that update, and so on. Like we both have to be manipulating functions on the whole shared prior for every update!]
For what it’s worth, I think both styles are pretty useful in the appropriate context. [I am moderately confident this is a situation where it’s worth doing the ‘grounded-in-reality’ particle-filtering approach, i.e. hitting the ‘be concrete’ and ‘be specific’ buttons over and over, and then once you’ve built out one hypothesis doing it again with new samples.]
I don’t think I believe the ‘should always have a mainline’ thing, but I do think I want to defend the weaker claim of “it’s worth having a mainline about this.” Like, I think if you’re starting a startup, it’s really helpful to have a ‘mainline plan’ wherein the whole thing actually works, even if you ascribe basically no probability to it going ‘exactly to plan’. Plans are useless, planning is indispensable.
[Also I think it’s neat that there’s a symmetry here about complaining about the uncertainty of the future, which makes sense if we’re both trying to hold onto different pieces of Bayesianism while looking at the same problem.]
If you define “mainline” as “particle with plurality weight”, then I think I was in fact “talking on my mainline” at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about “one of my top 10 particles”.
I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.
Huh, I guess I don’t believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it’s thinking, and yet I remain a black box to myself, I can’t really tell whether or not it’s aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”—but I think I have some more-obviously-alignment-related counterexamples. [Tho if by ‘infinite oversight ability’ you do mean something like ‘logical omniscience’ it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.]
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
(This is a long response to a short section because I think the disagreement here is about something like “how should we reason and communicate about intuitions?”, and so it’s worth expanding on what I think might be the implications of otherwise minor disagreements.)
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)
The most recent post has a related exchange between Eliezer and Rohin:
If I’m being locally nitpicky, I argue that Eliezer’s thing is a very mild overstatement (it should be “≤” instead of “<”) but given that we’re talking about forecasts, we’re talking about uncertainty, and so we should expect “less” optimism instead of just “not more” optimism, and so I think Eliezer’s statement stands as a general principle about engineering design.
This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there’s some deep confusion of mine that I’d like cleared up.
Question to Eliezer: would you agree with the gist of the following? And if not, any thoughts on what lead to a strong sense of ‘coherence in your worldview’ as Vaniver put it?
Vaniver, I feel like you’re pointing at something that I’ve noticed as well and am interested in too (the coherence of Eliezer’s worldview as you put it). I wonder if has something to do with not going to uni but building his whole worldview all by him self. In my experience uni often tends towards to cramming lots of facts which are easily testable on exams, with less emphasis on understanding underlying principles (which is harder to test with multiple choice questions). Personally I feel like I had to spend my years after uni trying to make sense, a coherent whole if you like, of all the separate things I’ve learned while in uni where things were mostly just kind of put out there without constantly integrating things. Perhaps if you start out thinking much more about underlying principles earlier on it’s easier to integrate all the separate facts into a coherent whole as you go along. Not sure if Eliezer would agree with this. Maybe it’s even much more basic and he just always had a very strong sense of dissatisfaction if he couldn’t make things cohere into a whole and this urge for things to make sense was much more important than self studying or thinking about underlying principles before and then during the learning of new knowledge...
I would like to point out a section in the latest Shay/Yudkowsky dialogue where Eliezer says some things about this topic, does this feel like it’s the same thing you are talking about Vaniver?
Note that my first response was:
and my immediately preceding message was
I think I was responding to the version of the argument where “freely combining surface desiderata” was swapped out with “arguments about what you’re selecting for”. I probably should have noted that I agreed with the basic abstract point as Eliezer stated it; I just don’t think it’s very relevant to the actual disagreement.
I think my complaints in the context of the discussion are:
It’s a very weak statement. If you freely combine the most optimistic surface desiderata, you get ~0% chance of doom. My estimate is way higher (in odds-space) than ~0%, and the statement “p(doom) >= ~0%” is not that interesting and not a justification of “doom is near-inevitable”.
Relatedly, I am not just “freely combining surface desiderata”. I am doing something like “predicting what properties AI systems would have by reasoning about what properties we selected for during training”. I think you could reasonably ask how that compares against “predicting what properties AI systems would have by reasoning about what mechanistic algorithms could produce the behavior we observed during training”. I was under the impression that this was what Eliezer was pointing at (because that’s how I framed it in the message immediately prior to the one you quoted) but I’m less confident of that now.
Sorry, I probably should have been more clear about the “this is a quote from a longer dialogue, the missing context is important.” I do think that the disagreement about “how relevant is this to ‘actual disagreement’?” is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you’re right that the thing you’re doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you’re selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there’s 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.
To use an example that makes me look bad, there was a time when I didn’t believe Arrow’s Impossibility Theorem because I was using the ‘freely combine surface desiderata’ strategy. The comment that snapped me out of it involved having to actually write out the whole voting rule, and see that I couldn’t instantiate the thing I thought I could instantiate.
As a more AI-flavored example, I was talking last night with Alex about ELK, specifically trying to estimate the relative population of honest reporters and dishonest reporters in the prior implied by the neural tangent kernel model, and he observed that if you had a constructive approach of generating initializations that only contained honest reporters, that might basically solve the ELK problem; after thinking about it for a bit I said “huh, that seems right but I’m not sure it’s possible to do that, because maybe any way to compose an honest reporter out of parts gives you all of the parts you need to compose a dishonest reporter.”
I agree that if you have a choice about whether to have more or fewer free parameters, all else equal you should prefer the model with fewer free parameters. (Obviously, all else is not equal; in particular I do not think that Eliezer’s model is tracking reality as well as mine.)
When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism. I know Eliezer thinks I have such a bias. I disagree with him.
I agree that this is true in some platonic sense. Either the argument gives me a correct answer, in which case I have true statements that could be cashed out in terms of mechanistic algorithms, or the argument gives me a wrong answer, in which case it wouldn’t be derivable from mechanistic algorithms, because the mechanistic algorithms are the “ground truth”.
Quoting myself from the dialogue:
That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like “not actually considering the space”, or “wasting consideration on easily-known-bad parts of the space”, and more constraints could help with both of those. But, as math, removing constraints can’t lower the volume of the implied space and so can’t make it less likely that a viable solution exists.
I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it’s a reasonable suspicion for anyone.
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking “yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate” and me as thinking “huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers.”]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They’d go around telling me “Ray, you’re exhibiting that bias right now. Whatever rationalization you’re coming up with right now, it’s not the real reason you’re arguing X.” And I was like “c’mon man. I have a ton of introspective access to myself and I can tell that this ‘rationalization’ is actually a pretty good reason to believe X and I trust that my reasoning process is real.”
But… eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on “is Ray displaying rational thought?”. When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on “does Ray seem biased in this particular way?”.
And both checks totally returned ‘true’, and that was an accurate assessment.
The particular moment where I noticed this metapattern, I’d say my cognition was, say, 65% “good argumentation”, 15% “one particular bias”, “20% other random stuff.” On a different day, it might have been that I was 65% exhibiting the bias and 15%.
None of this is making much claim of what’s likely to be going on in Rohin’s head or Eliezer’s head or whether Eliezer’s conversational pattern is useful, but wanted to flag it as a way people could be talking past each other.
I think we’re imagining different toy mathematical models.
Your model, according to me:
There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
We put a layer of abstraction on top of this space, characterizing approaches by N different “features” (e.g. “is it goal-directed”, “is it an oracle”, “is it capable of destroying the world”)
Because we’re bounded agents, we then treat the features as independent, and search for some combination of features that would comprise a solution.
I agree that this procedure has a systematic error in claiming that there is a solution when none exists (and doesn’t have the opposite error), and that if this were an accurate model of how I was reasoning I should be way more worried about correcting for that problem.
My model:
There is a probability distribution over “ways the world could be”.
We put a layer of abstraction on top of this space, characterizing “ways the world could be” by N different “features” (e.g. “can you get human-level intelligence out of a pile of heuristics”, “what are the returns to specialization”, “how different will AI ontologies be from human ontologies”). We estimate the marginal probability of each of those features.
Because we’re bounded agents, when we need the joint probability of two or more features, we treat them as independent and just multiply.
Given a proposed solution, we estimate its probability of working by identifying which features need to be true of the world for the solution to work, and then estimate the probability of those features (using the method above).
I claim that this procedure doesn’t have a systematic error in the direction of optimism (at least until you add some additional details), and that this procedure more accurately reflects the sort of reasoning that I am doing.
Huh, why doesn’t that procedure have that systematic error?
Like, when I try to naively run your steps 1-4 on “probability of there existing a number that’s both even and odd”, I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it’ll work out given that there are at least 4 numbers. But I can’t easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, “probability of there existing a number that’s both odd and prime” gives the wrong conclusion if you buy that the probability that a natural number is prime is 0, but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.]
My first guess is that you think I’m doing the “ways the world could be” thing wrong—like, I’m looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on “universe contains a number that is both even and odd” and its complement, as those are the two relevant ways the world can be.
My second guess is that you’ve got a different distribution over target predicates; like, we can just take the complement of my overestimate (“probability of there existing no numbers that are both even and odd”) and call it an underestimate. But I think I’m more interested in ‘overestimating existence’ than ‘underestimating non-existence’. [Is this an example of the ‘additional details’ you’re talking about?]
Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there’s a net bias.
It’s the first guess.
I think if you have a particular number then I’m like “yup, it’s fair to notice that we overestimate the probability that x is even and odd by saying it’s 25%”, and then I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors.
I don’t think I’m doing anything that’s analogous to that. I definitely don’t go around thinking “well, it seems 10% likely that such and such feature of the world holds, and so each alignment scheme I think of that depends on this feature has a 10% chance of working, therefore if I think of 10 alignment schemes I’ve solved the problem”. (I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.)
Cool, I like this example.
I think the thing I’m interested in is “what are our estimates of the output of search processes?”. The question we’re ultimately trying to answer with a model here is something like “are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?”
The example of “a number divisible by 2 and a number divisible by 4” is an example of where the consistency of your solution helps you—anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can’t increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can’t be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn’t an unbiased estimator of the joint).
For example, consider this recent analysis of cultured meat, which seems to me to point out a fundamental inconsistency of this type in people’s plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller!
[Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite you can combine massive surface area with tiny volume. But if you need massive volume and tiny surface area, that’s not possible. Anyway, in this case, my read is that both of these are based off of volume, and so there’s no clever technique like that available.]
Maybe you could step me thru how your procedure works for estimating the viability of cultured meat, or the possibility of constructing a room temperature <10 atm superconductor, or something?
It seems to me like there’s a version of your procedure which, like, considers all of the different possible factory designs, applies some functions to determine the high-level features of those designs (like profitability, amount of platinum they consume, etc.), and then when we want to know “is there a profitable cultured meat factory?” responds with “conditioning on profitability > 0, this is the set of possible designs.” And then when I ask “is there a profitable cultured meat factory using less than 1% of the platinum available on Earth?” says “sorry, that query is too difficult; I calculated the set of possible designs conditioned on profitability, calculated the set of possible designs conditioned on using less than 1% of the platinum available on Earth, and then <multiplied sets together> to give you this approximate answer.”
But of course that’s not what you’re doing, because the boundedness prevents you from considering all the different possible factory designs. So instead you have, like, clusters of factory designs in your map? But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details? [Or if they did fully populate the details for some limited number of considered objects, how do you back out the implied probability distribution over the non-considered objects in a way that isn’t subject to this?]
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn’t know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought “it’s highly improbable for an algorithm to demonstrate all of these features”. My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.
Tbc, I don’t want to claim “unbiased estimator” in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don’t have. I’m more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don’t expect the bias to be in a predictable direction.
They definitely have that problem. I’m not sure how you don’t have that problem; you’re always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can’t “fully populate the details” as an embedded agent.
If you’re asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).
I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what’s happening.
Most of the “objects” are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for “number of AGI orgs with a shot at time X”, “fraction of people who agree alignment is a problem”, “amount of optimization pressure needed to avoid deception”, etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.
When you have lots of time, you can improve your reasoning in many different ways:
You can find other factors that seem important, add them in, subdividing worlds out even further.
You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
You can refactor your “main factors” to be more independent of each other. For example, maybe you notice that all of your reasoning about things like “<metric> at time X” depends a lot on timelines, and so you instead replace them with factors like “<metric> at X years before crunch time”, where they are more independent of timelines.
To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you’re overconfident about your understanding of that domain. So I don’t think it helps distinguish those two cases.
(Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?)
I believe this was discussed further at some point—I argued that Eliezer-style political history books also exclude statements like “and then we survived the cold war” or “most countries still don’t have nuclear energy”.
It feels similar but clearly distinct? Like, in that situation Eliezer often seems to say things that I parse as “I don’t have any special knowledge here”, which seems like a different thing than “I can’t easily sample from my distribution over how things go right”, and I also have the sense of Paul being willing to ‘go specific’ and Eliezer not being willing to ‘go specific’.
You’re thinking of this bit of the conversation, starting with:
(Or maybe a bit earlier and later, but that was my best guess for where to start the context.)
The main quotes from the middle that seems relevant:
and ending with:
Rereading that section, my sense is that it reads like a sort of mirror of the Eliezer->Paul “I don’t know how to operate your view” section; like, Eliezer can say “I think nukes are less worrying for reasons ABC, also you can observe me being not worried about other things-people-are-concerned-by XYZ”, but I wouldn’t have expected you (or the reader who hasn’t picked up Eliezer-thinking from elsewhere) to have been able to come away from that with why you trying to be Eliezer from 1930s would have thought ‘and then it turned out okay’ would have been a political-history-book-sentence, or the relative magnitudes of the surprise. [Like, I think my 1930s-Eliezer puts like 3-30% on “and then it turned out okay” for nukes, and my 2020s-Eliezer puts like 0.03-3% on that for AGI? But it’d be nice to hear if Eliezer thinks AGI turning out as well as nukes is like 10x the surprise of nukes turning out this well conditioned on pre-1930s, or more like 1000x the surprise.]
This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:
That is all.
(Obviously there’s a kinda superficial resemblance here to the phenomenon of “calling out” somebody else; I want to state outright that this is not the intention, it’s just that I saw your comment right after seeing Rohin’s comment, in such a way that my memory of his remark was still salient enough that the connection jumped out at me. Since salient observations tend to fade over time, I wanted to put this down before that happened.)
Yeah, I’m also interested in the question of “how do we distinguish ‘sentences-on-mainline’ from ‘shoring-up-edge-cases’?”, or which conversational moves most develop shared knowledge, or something similar.
Like I think it’s often good to point out edge cases, especially when you’re trying to formalize an argument or look for designs that get us out of this trap. In another comment in this thread, I note that there’s a thing Eliezer said that I think is very important and accurate, and also think there’s an edge case that’s not obviously handled correctly.
But also my sense is that there’s some deep benefit from “having mainlines” and conversations that are mostly ‘sentences-on-mainline’? Or, like, there’s some value to more people thinking thru / shooting down their own edge cases (like I do in the mentioned comment), instead of pushing the work to Eliezer. I’m pretty worried that there are deeply general reasons to expect AI alignment to be extremely difficult, people aren’t updating on the meta-level point and continue to attempt ‘rolling their own crypto’, asking if Eliezer can poke the hole in this new procedure, and if Eliezer ever decides to just write serial online fiction until the world explodes humanity hasn’t developed enough capacity to replace him.
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
This makes it sound like the rest of us don’t try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.
Whereas in reality, I try to break my proposals, don’t agree with Eliezer’s diagnoses of the problems, and usually don’t ask Eliezer because I don’t expect his answer to be useful to me (and previously didn’t expect him to respond). I expect this is true of others (like Paul and Richard) as well.
Yeah, sorry about not owning that more, and for the frame being muddled. I don’t endorse the “asking Eliezer” or “agreeing with Eliezer” bits, but I do basically think he’s right about many object-level problems he identifies (and thus people disagreeing with him about that is not a feature) and think ‘security mindset’ is the right orientation to have towards AGI alignment. That hypothesis is a ‘worry’ primarily because asymmetric costs means it’s more worth investigating than the raw probability would suggest. [Tho the raw probability of components of it do feel pretty substantial to me.]
[EDIT: I should say I think ARC’s approach to ELK seems like a great example of “people breaking their own proposals”. As additional data to update on, I’d be interested in seeing, like, a graph of people’s optimism about ELK over time, or something similar.]
I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).
But:
I do note that there are some situations where rushing to tell a ‘mainline story’ might be the wrong move:
Maybe your beliefs feel wildly unstable day-to-day—because you’re learning a lot quickly, or because it’s just hard to know how to assign weight to the dozens of different considerations that bear on these questions. Then trying to take a quick snapshot of your current view might feel beside the point.
It might even feel actively counterproductive, like rushing too quickly to impose meaning/structure on data when step one is to make sure you have the data properly loaded up in your head.
Maybe there are many scenarios that seem similarly likely to you. If you see ten very different ways things could go, each with ~10% subjective probability, then picking a ‘mainline’ may be hard, and may require a bunch of arbitrary-feeling choices about which similarities-between-scenarios you choose to pay attention to.