I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you’d immediately reason that you couldn’t know how I would update, without first knowing what I expected to see in advance. Which you very clearly don’t know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn’t give you much evidence if you don’t know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn’t already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn’t what’s going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their very-short timelines and my somewhat-short timelines, and then they want to figure out what causes this discrepancy, and an easily-available question is “why doesn’t X imply short timelines” and then for some reason that I still don’t understand they instead substitute the much worse question of “why didn’t you update towards short timelines on X” without noticing its major flaws.
Fwiw, I was extremely surprised by OpenAI Five working with just vanilla PPO (with reward shaping and domain randomization), rather than requiring any advances in hierarchical RL. I made one massive update then (in the sense that I immediately started searching for a new model that explained that result; it did take over a year to get to a model I actually liked). I also basically adopted the bio anchors timelines when that report was released (primarily because it agreed with my model, elaborated on it, and then actually calculated out its consequences, which I had never done because it’s actually quite a lot of work). Apart from those two instances I don’t think I’ve had major timeline updates.
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors.
To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress.
I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines.
Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.
I agree your timelines can and should shift based on evidence even if you continue to believe in the bio anchors framework.
Personally, I completely ignore the genome anchor, and I don’t buy the lifetime anchor or the evolution anchor very much (I think the structure of the neural net anchors is a lot better and more likely to give the right answer).
Animals with smaller brains (like bees) are capable of few-shot learning, so I’m not really sure why observing few-shot learning is much of an update. See e.g. this post.
Essentially, the problem is that ‘evidence that shifts Bio Anchors weightings’ is quite different, more restricted, and much harder to define than the straightforward ‘evidence of impressive capabilities’. However, the reason that I think it’s worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being ‘another AI winter, scaling laws bend, and we don’t get excellent human-level performance on short term language-specified tasks anytime soon’, and don’t see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn’t another AI winter, rather it’s that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn’t come “for free” from competence on short-term tasks, if you’re as sample-inefficient as current ML is.
So what we’re really looking for isn’t systems getting progressively better and better at short-horizon language tasks. That’s something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.
We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren’t vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.
As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that’s not especially unlikely.
What’s the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?
There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that’s “human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales” that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.
However, I’d consider this changing the subject: essentially it’s not an argument for AGI takeover soon, rather it’s an argument for ‘certain narrow AIs are far more dangerous than you realize’. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can’t rely on the simple “it’ll be superintelligent and seize a DSA”.
Suppose we get such narrow AIs, that can do most short-term tasks for which there’s data, but don’t generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren’t competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people’s attention. Overall, it doesn’t seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.
The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you’re excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI—when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.
As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that’s not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don’t observe the described outcome of them generalizing out to much longer-term tasks than that.
So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can’t be “nothing”, since if we build AGI in 5 years, that counts.
I think the answer is, anything that looks like unexpectedly cheap, easy, ‘for free’ generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.
This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we’ll see a certain degree of generalizing from shorter to longer horizon tasks, and that we’ll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we’ll get enough of such generalization to get AGI. I guess we just need to look for ‘more of it than we expected to see’?
That seems very hard to judge, since you can’t read off predictions about subhuman capabilities from bio anchors like that.
when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously.
It does not seem to me like “can keep a train of thought running” implies “can take over the world” (or even “is comparable to a human”). I guess the idea is that with a train of thought you can do amplification? I’d be pretty surprised if train-of-thought-amplification on models of today (or 5 years from now) led to novel high quality scientific papers, even in fields that don’t require real-world experimentation.
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos!
I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor.
I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.
Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?
(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)
First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.
Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.
This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.
A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.
If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model
2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.
I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.
However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.
I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?
Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).
Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.
EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
Correct.
My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.
I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.
I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.
I have been doing political betting for a few months and informally compared my success with strategies 1 and 2.
Ex. Predicting the Iranian election
I write down the 10 most important iranian political actors (Khameini, Mojtaza, Raisi, a few opposition leaders, the IRGC commanders). I find a public statement about their prefered outcome, and I estimate their power and salience. So Khameini would be preference = leans Raisi, power = 100, salience = 40. Rouhani would be preference = strong Hemmeti, power = 30, salience = 100. Then I find the weighted average position. It’s a bit more complicated because I have to linearize preferences, but yeah.
The two strat is to predict repeated past events. The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
I have found 2 is actually pretty bad. Guess I’m an expert tho.
The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
That seems like a pretty bad 2-strat. Something that has happened three times is not a “stable high-level feature of the world”. (Especially if the preceding time it didn’t happen, which I infer since you didn’t say “the last four contested elections”.)
If that’s the best 2-strat available, I think I would have ex ante said that you should go with a 1-strat.
One way to communicate about uncertainty is to provide explicit probabilities, e.g. “I think it’s 20% likely that [...]”, or “I would put > 90% probability on [...]”. Another way to communicate about uncertainty is to use words like “plausible”, “possible”, “definitely”, “likely”, e.g. “I think it is plausible that [...]”.
People seem to treat the words as shorthands for probability statements. I don’t know why you’d do this, it’s losing information and increasing miscommunication for basically no reason—it’s maybe slightly more idiomatic English, but it’s not even much longer to just put the number into the sentence! (And you don’t have to have precise numbers, you can have ranges or inequalities if you want, if that’s what you’re using the words to mean.)
According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren’t making a decision, but you’re talking to someone who might use your advice to make decisions, but isn’t going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn’t have much to do with probabilities.
A simple model here is that your reasoning algorithm has two types of objects: first, strong constraints that rule out some possible worlds (“An AGI system that is widely deployed will have a massive impact”), and second, an implicit space of imagined scenarios in which each scenario feels similarly unsurprising if you observe it (this is the human emotion of surprise, not the probability-theoretic definition of surprise; one major difference is that the human emotion of surprise often doesn’t increase with additional conjuncts). Then, we can define the meanings of the words as follows:
Plausible: “This is within my space of imagined scenarios.”
Possible: “This isn’t within my space of imagined scenarios, but it doesn’t violate any of my strong constraints.” (Though unfortunately it can also mean “this violates a strong constraint, but I recognize that my constraint may be wrong, so I don’t assign literally zero probability to it”.)
Definitely: “If this weren’t true, that would violate one of my strong constraints.”
Implausible: “This violates one of my strong constraints.” (Given other definitions, this should really be called “impossible”, but unfortunately that word is unavailable (though I think Eliezer uses it this way anyway).)
When someone gives you an example scenario, it’s much easier to label it with one of the four words above, because their meanings are much much closer to the native way in which brains reason (at least, for my brain). That’s why I end up using these words rather than always talking in probabilities. To convert these into actual probabilities, I then have to do some additional reasoning in order to take into account things like model uncertainty, epistemic deference, conjunctive fallacy, estimating the “size” of the space of imagined scenarios to figure out the average probability for any given possibility, etc.
I like this, but it feels awkward to say that something can be not inside a space of “possibilities” but still be “possible”. Maybe “possibilities” here should be “imagined scenarios”?
“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.
Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).
I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having some kind of link”)
That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”
I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.
maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.
Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ’claims that have been reasonably well argued.
I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.
Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.
It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost all of them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).
However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)
Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.
In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.
By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.
Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.
You’ve heard of crucial considerations, but have you heard of red herring considerations?
These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.
To solve a problem quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.
For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.
Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).
When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”
An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:
In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.
It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.
Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.
I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).
Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.
Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:
Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)
Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.
At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).
Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.
----
What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.
In that example, X is “AI will not take over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.
I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.
EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.
In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.
Maybe another way to say this is:
I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)
“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)
The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.
A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)
However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.
I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?
Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.
(This is under the simple model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)
Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).
This is true if “the standard setting” refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer’s curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.
(Where I’m using “more robust evidence” to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer’s curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer’s curse, though nowadays I mostly don’t hear it cited at all.)
I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you’d immediately reason that you couldn’t know how I would update, without first knowing what I expected to see in advance. Which you very clearly don’t know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn’t give you much evidence if you don’t know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn’t already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn’t what’s going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their very-short timelines and my somewhat-short timelines, and then they want to figure out what causes this discrepancy, and an easily-available question is “why doesn’t X imply short timelines” and then for some reason that I still don’t understand they instead substitute the much worse question of “why didn’t you update towards short timelines on X” without noticing its major flaws.
Fwiw, I was extremely surprised by OpenAI Five working with just vanilla PPO (with reward shaping and domain randomization), rather than requiring any advances in hierarchical RL. I made one massive update then (in the sense that I immediately started searching for a new model that explained that result; it did take over a year to get to a model I actually liked). I also basically adopted the bio anchors timelines when that report was released (primarily because it agreed with my model, elaborated on it, and then actually calculated out its consequences, which I had never done because it’s actually quite a lot of work). Apart from those two instances I don’t think I’ve had major timeline updates.
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors.
To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress.
I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines.
Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.
I agree your timelines can and should shift based on evidence even if you continue to believe in the bio anchors framework.
Personally, I completely ignore the genome anchor, and I don’t buy the lifetime anchor or the evolution anchor very much (I think the structure of the neural net anchors is a lot better and more likely to give the right answer).
Animals with smaller brains (like bees) are capable of few-shot learning, so I’m not really sure why observing few-shot learning is much of an update. See e.g. this post.
Essentially, the problem is that ‘evidence that shifts Bio Anchors weightings’ is quite different, more restricted, and much harder to define than the straightforward ‘evidence of impressive capabilities’. However, the reason that I think it’s worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being ‘another AI winter, scaling laws bend, and we don’t get excellent human-level performance on short term language-specified tasks anytime soon’, and don’t see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn’t another AI winter, rather it’s that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn’t come “for free” from competence on short-term tasks, if you’re as sample-inefficient as current ML is.
So what we’re really looking for isn’t systems getting progressively better and better at short-horizon language tasks. That’s something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.
We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren’t vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.
As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that’s not especially unlikely.
What’s the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?
There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that’s “human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales” that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.
However, I’d consider this changing the subject: essentially it’s not an argument for AGI takeover soon, rather it’s an argument for ‘certain narrow AIs are far more dangerous than you realize’. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can’t rely on the simple “it’ll be superintelligent and seize a DSA”.
Suppose we get such narrow AIs, that can do most short-term tasks for which there’s data, but don’t generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren’t competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people’s attention. Overall, it doesn’t seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.
The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you’re excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI—when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.
As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that’s not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don’t observe the described outcome of them generalizing out to much longer-term tasks than that.
So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can’t be “nothing”, since if we build AGI in 5 years, that counts.
I think the answer is, anything that looks like unexpectedly cheap, easy, ‘for free’ generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.
This is different from many of the other signs of impressiveness we’ve seen recently: just learning lots of shorter-horizon tasks without much transfer between them, being able to point models successfully at particular short-horizon tasks with good prompting, getting much better at a wider range of tasks that can only be done over short horizons. All of these are expected on either view.
This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we’ll see a certain degree of generalizing from shorter to longer horizon tasks, and that we’ll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we’ll get enough of such generalization to get AGI. I guess we just need to look for ‘more of it than we expected to see’?
That seems very hard to judge, since you can’t read off predictions about subhuman capabilities from bio anchors like that.
Yeah, this all seems right to me.
It does not seem to me like “can keep a train of thought running” implies “can take over the world” (or even “is comparable to a human”). I guess the idea is that with a train of thought you can do amplification? I’d be pretty surprised if train-of-thought-amplification on models of today (or 5 years from now) led to novel high quality scientific papers, even in fields that don’t require real-world experimentation.
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos!
I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor.
I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.
Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?
(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)
First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.
Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.
This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.
A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.
From the Truthful AI paper:
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
Consider two methods of thinking:
1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model
2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.
I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.
However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.
I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?
Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).
Options 1 & 2 sound to me a lot like inside view and outside view. Fair?
Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.
EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)
Correct.
I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
Good point!
I recently interviewed someone who has a lot of experience predicting systems, and they had 4 steps similar to your two above.
Observe the world and see if it’s sufficient to other systems to predict based on intuitionistic analogies.
If there’s not a good analogy, Understand the first principles, then try to reason about the equilibria of that.
If that doesn’t work, Assume the world will stay in a stable state, and try to reason from that.
If that doesn’t work, figure out the worst case scenario and plan from there.
I think 1 and 2 are what you do with expertise, and 3 and 4 are what you do without expertise.
Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.
I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.
I have been doing political betting for a few months and informally compared my success with strategies 1 and 2.
Ex. Predicting the Iranian election
I write down the 10 most important iranian political actors (Khameini, Mojtaza, Raisi, a few opposition leaders, the IRGC commanders). I find a public statement about their prefered outcome, and I estimate their power and salience. So Khameini would be preference = leans Raisi, power = 100, salience = 40. Rouhani would be preference = strong Hemmeti, power = 30, salience = 100. Then I find the weighted average position. It’s a bit more complicated because I have to linearize preferences, but yeah.
The two strat is to predict repeated past events. The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
I have found 2 is actually pretty bad. Guess I’m an expert tho.
That seems like a pretty bad 2-strat. Something that has happened three times is not a “stable high-level feature of the world”. (Especially if the preceding time it didn’t happen, which I infer since you didn’t say “the last four contested elections”.)
If that’s the best 2-strat available, I think I would have ex ante said that you should go with a 1-strat.
Haha agreed.
One way to communicate about uncertainty is to provide explicit probabilities, e.g. “I think it’s 20% likely that [...]”, or “I would put > 90% probability on [...]”. Another way to communicate about uncertainty is to use words like “plausible”, “possible”, “definitely”, “likely”, e.g. “I think it is plausible that [...]”.
People seem to treat the words as shorthands for probability statements. I don’t know why you’d do this, it’s losing information and increasing miscommunication for basically no reason—it’s maybe slightly more idiomatic English, but it’s not even much longer to just put the number into the sentence! (And you don’t have to have precise numbers, you can have ranges or inequalities if you want, if that’s what you’re using the words to mean.)
According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren’t making a decision, but you’re talking to someone who might use your advice to make decisions, but isn’t going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn’t have much to do with probabilities.
A simple model here is that your reasoning algorithm has two types of objects: first, strong constraints that rule out some possible worlds (“An AGI system that is widely deployed will have a massive impact”), and second, an implicit space of imagined scenarios in which each scenario feels similarly unsurprising if you observe it (this is the human emotion of surprise, not the probability-theoretic definition of surprise; one major difference is that the human emotion of surprise often doesn’t increase with additional conjuncts). Then, we can define the meanings of the words as follows:
Plausible: “This is within my space of imagined scenarios.”
Possible: “This isn’t within my space of imagined scenarios, but it doesn’t violate any of my strong constraints.” (Though unfortunately it can also mean “this violates a strong constraint, but I recognize that my constraint may be wrong, so I don’t assign literally zero probability to it”.)
Definitely: “If this weren’t true, that would violate one of my strong constraints.”
Implausible: “This violates one of my strong constraints.” (Given other definitions, this should really be called “impossible”, but unfortunately that word is unavailable (though I think Eliezer uses it this way anyway).)
When someone gives you an example scenario, it’s much easier to label it with one of the four words above, because their meanings are much much closer to the native way in which brains reason (at least, for my brain). That’s why I end up using these words rather than always talking in probabilities. To convert these into actual probabilities, I then have to do some additional reasoning in order to take into account things like model uncertainty, epistemic deference, conjunctive fallacy, estimating the “size” of the space of imagined scenarios to figure out the average probability for any given possibility, etc.
I like this, but it feels awkward to say that something can be not inside a space of “possibilities” but still be “possible”. Maybe “possibilities” here should be “imagined scenarios”?
That does seem like better terminology! I’ll go change it now.
I like this experiment! Keep ’em coming.
“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.
Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).
I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having some kind of link”)
That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”
I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.
I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.
I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.
Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.
This is not just one man’s modus ponens—the key issue is the selection effect.
It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost all of them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).
However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)
Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.
In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.
By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.
Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.
You’ve heard of crucial considerations, but have you heard of red herring considerations?
These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.
To solve a problem quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.
For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.
Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).
Alternate names: sham considerations? insignificant considerations?
When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”
An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:
In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.
It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.
Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.
I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).
Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.
Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:
Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)
Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.
At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).
Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.
----
What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.
An argument form that I like:
I think this should be convincing even if Y is false, unless you can explain why your argument for X does not work under assumption Y.
An example: any AI safety story (X) should also work if you assume that the AI does not have the ability to take over the world during training (Y).
Trying to follow this. Doesn’t the Y (AI not taking over the world during training) make it less likely that X(AI will take over the world at all)?
Which seems to contradict the argument structure. Perhaps you can give a few more examples to make more clear the structure?
In that example, X is “AI will not take over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.
I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.
EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.
In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.
Maybe another way to say this is:
I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)
“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)
The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.
A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)
However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.
I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?
Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.
(This is under the simple model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)
Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).
This is true if “the standard setting” refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer’s curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.
(Where I’m using “more robust evidence” to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer’s curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
Agreed.
(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer’s curse, though nowadays I mostly don’t hear it cited at all.)