I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you’d immediately reason that you couldn’t know how I would update, without first knowing what I expected to see in advance. Which you very clearly don’t know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn’t give you much evidence if you don’t know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn’t already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn’t what’s going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their very-short timelines and my somewhat-short timelines, and then they want to figure out what causes this discrepancy, and an easily-available question is “why doesn’t X imply short timelines” and then for some reason that I still don’t understand they instead substitute the much worse question of “why didn’t you update towards short timelines on X” without noticing its major flaws.
Fwiw, I was extremely surprised by OpenAI Five working with just vanilla PPO (with reward shaping and domain randomization), rather than requiring any advances in hierarchical RL. I made one massive update then (in the sense that I immediately started searching for a new model that explained that result; it did take over a year to get to a model I actually liked). I also basically adopted the bio anchors timelines when that report was released (primarily because it agreed with my model, elaborated on it, and then actually calculated out its consequences, which I had never done because it’s actually quite a lot of work). Apart from those two instances I don’t think I’ve had major timeline updates.
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors.
To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress.
I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines.
Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.
I agree your timelines can and should shift based on evidence even if you continue to believe in the bio anchors framework.
Personally, I completely ignore the genome anchor, and I don’t buy the lifetime anchor or the evolution anchor very much (I think the structure of the neural net anchors is a lot better and more likely to give the right answer).
Animals with smaller brains (like bees) are capable of few-shot learning, so I’m not really sure why observing few-shot learning is much of an update. See e.g. this post.
Essentially, the problem is that ‘evidence that shifts Bio Anchors weightings’ is quite different, more restricted, and much harder to define than the straightforward ‘evidence of impressive capabilities’. However, the reason that I think it’s worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being ‘another AI winter, scaling laws bend, and we don’t get excellent human-level performance on short term language-specified tasks anytime soon’, and don’t see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn’t another AI winter, rather it’s that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn’t come “for free” from competence on short-term tasks, if you’re as sample-inefficient as current ML is.
So what we’re really looking for isn’t systems getting progressively better and better at short-horizon language tasks. That’s something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.
We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren’t vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.
As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that’s not especially unlikely.
What’s the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?
There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that’s “human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales” that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.
However, I’d consider this changing the subject: essentially it’s not an argument for AGI takeover soon, rather it’s an argument for ‘certain narrow AIs are far more dangerous than you realize’. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can’t rely on the simple “it’ll be superintelligent and seize a DSA”.
Suppose we get such narrow AIs, that can do most short-term tasks for which there’s data, but don’t generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren’t competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people’s attention. Overall, it doesn’t seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.
The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you’re excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI—when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.
As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that’s not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don’t observe the described outcome of them generalizing out to much longer-term tasks than that.
So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can’t be “nothing”, since if we build AGI in 5 years, that counts.
I think the answer is, anything that looks like unexpectedly cheap, easy, ‘for free’ generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.
This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we’ll see a certain degree of generalizing from shorter to longer horizon tasks, and that we’ll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we’ll get enough of such generalization to get AGI. I guess we just need to look for ‘more of it than we expected to see’?
That seems very hard to judge, since you can’t read off predictions about subhuman capabilities from bio anchors like that.
when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously.
It does not seem to me like “can keep a train of thought running” implies “can take over the world” (or even “is comparable to a human”). I guess the idea is that with a train of thought you can do amplification? I’d be pretty surprised if train-of-thought-amplification on models of today (or 5 years from now) led to novel high quality scientific papers, even in fields that don’t require real-world experimentation.
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos!
I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor.
I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.
Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?
(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)
First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.
Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.
This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.
A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.
If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model
2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.
I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.
However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.
I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?
Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).
Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.
EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
Correct.
My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.
I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.
I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.
I have been doing political betting for a few months and informally compared my success with strategies 1 and 2.
Ex. Predicting the Iranian election
I write down the 10 most important iranian political actors (Khameini, Mojtaza, Raisi, a few opposition leaders, the IRGC commanders). I find a public statement about their prefered outcome, and I estimate their power and salience. So Khameini would be preference = leans Raisi, power = 100, salience = 40. Rouhani would be preference = strong Hemmeti, power = 30, salience = 100. Then I find the weighted average position. It’s a bit more complicated because I have to linearize preferences, but yeah.
The two strat is to predict repeated past events. The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
I have found 2 is actually pretty bad. Guess I’m an expert tho.
The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
That seems like a pretty bad 2-strat. Something that has happened three times is not a “stable high-level feature of the world”. (Especially if the preceding time it didn’t happen, which I infer since you didn’t say “the last four contested elections”.)
If that’s the best 2-strat available, I think I would have ex ante said that you should go with a 1-strat.
One way to communicate about uncertainty is to provide explicit probabilities, e.g. “I think it’s 20% likely that [...]”, or “I would put > 90% probability on [...]”. Another way to communicate about uncertainty is to use words like “plausible”, “possible”, “definitely”, “likely”, e.g. “I think it is plausible that [...]”.
People seem to treat the words as shorthands for probability statements. I don’t know why you’d do this, it’s losing information and increasing miscommunication for basically no reason—it’s maybe slightly more idiomatic English, but it’s not even much longer to just put the number into the sentence! (And you don’t have to have precise numbers, you can have ranges or inequalities if you want, if that’s what you’re using the words to mean.)
According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren’t making a decision, but you’re talking to someone who might use your advice to make decisions, but isn’t going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn’t have much to do with probabilities.
A simple model here is that your reasoning algorithm has two types of objects: first, strong constraints that rule out some possible worlds (“An AGI system that is widely deployed will have a massive impact”), and second, an implicit space of imagined scenarios in which each scenario feels similarly unsurprising if you observe it (this is the human emotion of surprise, not the probability-theoretic definition of surprise; one major difference is that the human emotion of surprise often doesn’t increase with additional conjuncts). Then, we can define the meanings of the words as follows:
Plausible: “This is within my space of imagined scenarios.”
Possible: “This isn’t within my space of imagined scenarios, but it doesn’t violate any of my strong constraints.” (Though unfortunately it can also mean “this violates a strong constraint, but I recognize that my constraint may be wrong, so I don’t assign literally zero probability to it”.)
Definitely: “If this weren’t true, that would violate one of my strong constraints.”
Implausible: “This violates one of my strong constraints.” (Given other definitions, this should really be called “impossible”, but unfortunately that word is unavailable (though I think Eliezer uses it this way anyway).)
When someone gives you an example scenario, it’s much easier to label it with one of the four words above, because their meanings are much much closer to the native way in which brains reason (at least, for my brain). That’s why I end up using these words rather than always talking in probabilities. To convert these into actual probabilities, I then have to do some additional reasoning in order to take into account things like model uncertainty, epistemic deference, conjunctive fallacy, estimating the “size” of the space of imagined scenarios to figure out the average probability for any given possibility, etc.
I like this, but it feels awkward to say that something can be not inside a space of “possibilities” but still be “possible”. Maybe “possibilities” here should be “imagined scenarios”?
“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.
Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).
I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having some kind of link”)
That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”
I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.
maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.
Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ’claims that have been reasonably well argued.
I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.
Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.
It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost all of them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).
However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)
Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.
In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.
By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.
Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.
You’ve heard of crucial considerations, but have you heard of red herring considerations?
These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.
To solve a problem quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.
For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.
Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).
When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”
An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:
In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.
It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.
Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.
I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).
Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.
Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:
Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)
Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.
At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).
Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.
----
What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.
In that example, X is “AI will not take over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.
I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.
EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.
In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.
Maybe another way to say this is:
I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)
“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)
The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.
A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)
However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.
I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?
Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.
(This is under the simple model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)
Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).
This is true if “the standard setting” refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer’s curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.
(Where I’m using “more robust evidence” to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer’s curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer’s curse, though nowadays I mostly don’t hear it cited at all.)
It’s common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don’t have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.
I’ll go through the articles I’ve read that argue for worrying about recommender systems, and explain why I find them unconvincing. I’ve only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.
A few sources say that it is bad + it has incredible scale + it should be super easy to solve. (I don’t trust the sources and suspect the authors didn’t check them; I agree there’s huge scale; I don’t see why it should be super easy to solve even if there is a problem, especially given that many of the supposed problems seem to have existed before recommender systems.)
Maybe working on recommender systems would have spillover effects on AI alignment. (This seems dominated by just working directly on AI alignment. Also the core feature of AI alignment is that the AI system deliberately and intentionally does things, and creates plans in new situations that you hadn’t seen before, which is not the case with recommender systems, so I don’t expect many spillover effects.)
I don’t know what the main claim was. Ostensibly it was meant to be “it is bad that companies have monetized human attention since this leads to lots of bad incentives and bad outcomes”. But then so many specific things mentioned have nothing to do with this claim and instead seem to be a vague general “tech companies are bad”. Most egregiously, in section Global effects [01:02:44], Rob argues “WhatsApp doesn’t have ads / recommender systems, so it acts as a control group, but it too has bad outcomes, doesn’t this mean the problem isn’t ads / recommender systems?” and Tristan says “That’s right, WhatsApp is terrible, it’s causing mass lynchings” as though that supports his point.
When Rob made some critique of the main argument, Tristan deflected with an example of tech doing bad things. But it’s always vaguely related, so you think he’s addressing the critique, even though he hasn’t actually. (I’m reminded of the Zootopia strategy for press conferences.) See sections “The messy real world vs. an imagined idealised world [00:38:20]” (Rob: weren’t negative things happening before social media? Tristan: it’s easy to fake credibility in text), “The persuasion apocalypse [00:47:46]” (Rob: can’t one-on-one conversations be persuasive too? Tristan: you can lie in political ads), “Revolt of the Public [00:56:48]” (Rob: doesn’t the internet allow ordinary people to challenge established institutions in good ways? Tristan: Alex Jones has been recommended 15 billion times.)
US politics [01:13:32] is a rare counterexample, where Rob says “why aren’t other countries getting polarized”, and Tristan replies “since it’s a positive feedback loop only countries with high initial polarization will see increasing polarization”. It’s not a particularly convincing response, but at least it’s a response.
Tristan seems to be very big on “the tech companies changed what they were doing, that proves we were right”. I think it is just as consistent to say “we yelled at the companies a lot and got the public to yell at them too, and that caused a change, regardless of whether the problem was serious or not, or whether the solution was net positive or not”.
The second half of the podcast focuses more on solutions. Given that I am unconvinced about the problem, I wasn’t all that interested, but it seemed generally reasonable.
(This post responds to the object level claims, which I have not done because I don’t know much about the object level.)
There’s also the documentary “The Social Dilemma”, but I expect it’s focused entirely on problems, probably doesn’t try to have good rigorous statistics, and surely will make no attempt at a cost-benefit analysis so I seriously doubt it would change my mind on anything. (And it is associated with Tristan Harris so I’d assume that most of the relevant details would have made it into the 80K podcast.)
Recommender systems are still influential, and you could want to work on them just because of their huge scale. I like Designing Recommender Systems to Depolarize as an example of what this might look like.
Thanks for this Rohin. I’ve been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven’t actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I’d love to see someone really lay out the arguments and analyze them… My current take is that (some of) the armchair theories seem pretty plausible to me, such that I’d believe them unless the data contradicts. But I’m extremely uncertain about this.
I’ve been trying to raise awareness about the potential dangers persuasion/propaganda tools
I should note that there’s a big difference between “recommender systems cause polarization as a side effect of optimizing for engagement” and “we might design tools that explicitly aim at persuasion / propaganda”. I’m confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it’s effects will be.
My current take is that (some of) the armchair theories seem pretty plausible to me, such that I’d believe them unless the data contradicts.
Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better than a monkey throwing darts, so I’m more in the position of “yup, sounds plausible, but that doesn’t constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out”.
I should note that there’s a big difference between “recommender systems cause polarization as a side effect of optimizing for engagement” and “we might design tools that explicitly aim at persuasion / propaganda”. I’m confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it’s effects will be.
Oh, then maybe we don’t actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it’s also totally plausible that polarization is going up for some other reason(s). My concern (as illustrated in the vignette I wrote) is that we seem to be on a slippery slope to a world where persuasion/propaganda is more effective and widespread than it has been historically, thanks to new AI and big data methods. My model is: Ideologies and other entities have always been using propaganda of various kinds, and there’s always been a race between improving propaganda tech and improving truth-finding tech, but we are currently in a big AI boom and in particular in a Big Data and Natural Language Processing boom, and this seems like it’ll be a big boost to propaganda tech, and unfortunately I can’t think of ways in which it will correspondingly boost truth-finding-ness across society, because while it can be used to make truth-finding tech maybe (e.g. prediction markets, fact-checkers, etc.) it seems like most people in practice just don’t want to adopt truth-finding tech. It’s true that we could design a different society/culture that used all this awesome new tech to be super truth-seeking and have a very epistemically healthy discourse, but it seems like we are not about to do that anytime soon, instead we are going in the opposite direction.
I think that story involves lots of assumptions I don’t immediately believe (but don’t disbelieve either):
People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
Such people will quickly realize that AI will be very useful for this
They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
The resulting AI system will in fact be very good at persuasion / propaganda
AI that fights persuasion / propaganda either won’t be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can’t keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won’t be true with AI)
And probably there are a bunch of other assumptions I haven’t even thought to question.
I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be “raise awareness”, it should be “figure out whether the assumptions are justified”.
I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be “raise awareness”, it should be “figure out whether the assumptions are justified”.
That’s all I’m trying to do at this point, to be clear. Perhaps “raise awareness” was the wrong choice of phrase.
Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in mind. [EDIT: But I agree with you that it’s all super unclear and more research is needed to have confidence in any of this.]
That’s all I’m trying to do at this point, to be clear.
Excellent :)
For how I see this going, see my vignette, and my reply to steve.
(Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above.
Maybe your point is that your story involves “silos” of Internet-space within which particular ideologies / propaganda reign supreme. I don’t really see that as changing my object-level points very much but perhaps I’m missing something.
I was confusing, sorry—what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is… loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one:
People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
This is already happening in 2021 and previous, in my story it happens more.
Such people will quickly realize that AI will be very useful for this
Again, this is already happening.
They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
Plenty of people are already raising a moral outcry. In my story these people don’t succeed in getting it banned, but I agree the story could be wrong. I hope it is!
The resulting AI system will in fact be very good at persuasion / propaganda
Yep. I don’t have hard evidence, but intuitively this feels like the sort of thing today’s AI techniques would be good at, or at least good-enough-to-improve-on-the-state-of-the-art.
AI that fights persuasion / propaganda either won’t be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can’t keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won’t be true with AI)
I think it won’t be built & deployed in such a way that collective epistemology is overall improved. Instead, the propaganda-fighting AIs will themselves have blind spots, to allow in the propaganda of the “good guys.” The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc. (I think what happened with the internet is precedent for this. In theory, having all these facts available at all of our fingertips should have led to a massive improvement in collective epistemology and a massive improvement in truthfulness, accuracy, balance, etc. in the media. But in practice it didn’t.) It’s possible I’m being too cynical here of course!
technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is… loaded? Designed to make them seem implausible?
I don’t think it’s designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions).
I think it’s fair to say it’s “loaded”, in the sense that I am trying to push towards questioning those assumptions, but I don’t think I’m doing anything epistemically unvirtuous.
This is already happening in 2021 and previous, in my story it happens more.
This does not seem obvious to me (but I also don’t pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious).
The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc.
That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don’t really see “number of facts” as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.
(I just tried to find the best argument that GMOs aren’t going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won’t cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)
Perhaps I shouldn’t have mentioned any of this. I also don’t think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time.
That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don’t really see “number of facts” as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.
The first bit seems in tension with the second bit, no? At any rate, I also don’t see number of facts as the relevant thing for epistemology. I totally agree with your take here.
The first bit seems in tension with the second bit, no?
“Truthful counterarguments” is probably not the best phrase; I meant something more like “epistemically virtuous counterarguments”. Like, responding to “what if there are long-term harms from COVID vaccines” with “that’s possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer” rather than “there is no evidence of long-term harms”.
If you look at my posting history, you’ll see that all posts I’ve made on LW (two!) are negative toward social media and one calls out recommender systems explicitly. This post has made me reconsider some of my beliefs, thank you.
I realized that, while I have heard Tristan Harris, read The Attention Merchants, and perused other, similar sources, I haven’t looked for studies or data to back it all up. It makes sense on a gut level—that these systems can feed carefully curated information to softly steer a brain toward what the algorithm is optimizing for—but without more solid data, I found I can’t quite tell if this is real or if it’s just “old man yells at cloud.”
Subjectively, I’ve seen friends and family get sucked into social media and change into more toxic versions of themselves. Or maybe they were always assholes, and social media just lent them a specific, hivemind kind of flavor, which triggered my alarms? Hard to say.
Subjectively, I’ve seen friends and family get sucked into social media and change into more toxic versions of themselves. Or maybe they were always assholes, and social media just lent them a specific, hivemind kind of flavor, which triggered my alarms? Hard to say.
Fwiw, I am a lot more compelled by the general story “we are now seeing examples of bad behavior from the ‘other’ side that are selected across hundreds of millions of people, instead of thousands of people; our intuitions are not calibrated for this” (see e.g. here). That issue seems like a consequence of more global reach + more recording of bad stuff that happens. Though if I were planning to make it my career I would spend way more time figuring out whether that story is true as well.
This was a good post. I’d bookmark it, but unfortunately that functionality doesn’t exist yet.* (Though if you have any open source bookmark plugins to recommend, that’d be helpful.) I’m mostly responding to say this though:
While it wasn’t otherwise mentioned in the abstract of the paper (above), this was stated once:
This paper examines algorithmic depolarization interventions with the goal of conflict transformation: not suppressing or eliminating conflict but moving towards more constructive conflict.
I though this was worth calling out, although I am still in the process of reading that 10⁄14 page paper. (There are 4 pages of references.)
And some other commentary while I’m here:
It’s common for people to be worried about recommender systems being addictive
I imagine the recommender system is only as good as what it has to work with, content wise—and that’s before getting into ‘what does the recommender system have to go off of’, and ‘what does it do with what it has’.
Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.
This part wasn’t elaborated on. To put it a different way:
It’s common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don’t have very good arguments for these worries.
Do the people ‘who know what’s going’ on (presumably) have better arguments? Do you?
*I also have a suspicion it’s not being used. I.e., past a certain number of bookmarks like 10, it’s not actually feasible to use the LW interface to access them.
Do the people ‘who know what’s going’ on (presumably) have better arguments?
Possibly, but if so, I haven’t seen them.
My current belief is “who knows if there’s a major problem with recommender systems or not”. I’m not willing to defer to them, i.e. say “there probably is a problem based on the fact that the people who’ve studied them think there’s a problem”, because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources).
Do you?
No. I just have anecdotal experience + armchair speculation, which I don’t expect to be much better at uncovering the truth than the arguments I’m critiquing.
This might still be good for generating ideas (if not far more accurate than brainstorming or trying to come up with a way to generate models via ‘brute force’).
But the real trick is—how do we test these sorts of ideas?
Agreed this can be useful for generating ideas (and I do tons of it myself; I have hundreds of pages of docs filled with speculation on AI; I’d probably think most of it is garbage if I went back and looked at it now).
We can test the ideas in the normal way? Run RCTs, do observational studies, collect statistics, conduct literature reviews, make predictions and check them, etc. The specific methods are going to depend on the question at hand (e.g. in my case, it was “read thousands of articles and papers on AI + AI safety”).
The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.
I don’t trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).
To be worried about a possibility does not require that the possibility is an actuality.
I am more annoyed by the sheer confidence people have. If they were saying “this is a possibility, let’s investigate” that seems fine.
Re: the rest of your comment, I feel like you are casting it into a decision framework while ignoring the possible decision “get more information about whether there is a problem or not”, which seems like the obvious choice given lack of confidence.
If at some point you become convinced that it is impossible / too expensive to get more information (I’d be really suspicious, but it could be true) then I’d agree you should bias towards worry.
I would guess that the fact that people regularly fail to inhabit the mindset of “I don’t know that this is a problem, let’s try to figure out whether it is actually a problem” is a source of tons of problems in society (e.g. anti-vaxxers, worries that WiFi radiation kills you, anti-GMO concerns, worries about blood clots for COVID vaccines, …). Admittedly in these cases the people are making a mistake of being confident, but even if you fixed the overconfidence they would continue to behave similarly if they used the reasoning in your comment. Certainly I don’t personally know why you should be super confident that GMOs aren’t harmful, and I’m unclear on whether humanity as a whole has the knowledge to be super confident in that.
I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I’d crosspost here as a reference.
Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don’t know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is “math poetry about what we want”.
Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward—it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn’t apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
Though [the claim that slightly wrong observation model ⇒ doom] isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I mentioned above that I’m not that keen on assistance games because they don’t seem like a great fit for the specific ways we’re getting capabilities now. A more direct comment on this point that I recently wrote:
I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)
The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).
But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)
The abstract says “we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt.” This clearly can’t be true in full generality, and I wish the abstract would give me some hint about what assumptions they’re making. But we can look at the details in the paper.
(This next part isn’t fully self-contained, you’ll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)
(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)
Looking into it I don’t think the theorem even holds? In particular, Theorem 1 says:
Theorem 1. Let γ ∈ [−1, 0) and let B be a behaviour and P be an unprompted language model such that B is α, β, γ-distinguishable in P (definition 3), then P is γ-prompt-misalignable to B (definition 1) with prompt length of O(log 1 / Є , log 1 / α , 1 / β ).
Here is a counterexample:
Let the LLM be P(s∣s0)={0.8if s="A"0.2if s="B" and s0≠""0.2if s="C" and s0=""0otherwise
Let the behavior predicate be B(s)={−1if s="C"+1otherwise
Note that B is (0.2,10,−1)-distinguishable in P. (I chose β=10 here but you can use any finite β.)
(Proof: P can be decomposed as P=0.2P−+0.8P+, where P+ deterministically outputs “A” while P− does everything else, i.e. it deterministically outputs “C” if there is no prompt, and otherwise deterministically outputs “B”. Since P+ and P− have non-overlapping supports, the KL-divergence between them is ∞, making them β-distinguishable for any finite β. Finally, choosing s∗="", we can see that BP−(s∗)=Es∼P−(⋅∣s∗)[B(s)]=B("C")=−1. These three conditions are what is needed.)
However, P is not (-1)-prompt-misalignable w.r.t B, because there is no prompt s0 such that EP[B(s0)] is arbitrarily close to (or below) −1, contradicting the theorem statement. (This is because the only way for P to get a behavior score that is not +1 is for it to generate “C” after the empty prompt, and that only happens with probability 0.2.)
I think this isn’t right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = “C” will have B_P-(s*) be 0 (because there’s 0 probably of generating “C” in the future). So the sup is at least 0 > −1.
(Note that they’ve modified the paper, including definition 3, but this comment is written based on the old version.)
You’re right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.
I’m still not very compelled by the theorem—it’s saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don’t really feel like I’ve learned anything from this theorem.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
P=αP0+(1−α)P1,
such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated).
I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
I occasionally hear the argument “civilization is clearly insane, we can’t even do the obvious thing of <insert economic argument here, e.g. carbon taxes>”.
But it sounds to me like most rationalist / EA group houses didn’t do the “obvious thing” of taxing COVID-risky activities (which basically follows the standard economic argument of pricing in externalities). What’s going on? Some hypotheses:
Actually, taxing COVID-risky activities is not a good solution EDIT: and group houses recognized this. (Why? It seemed to work pretty well for my group house.)
Actually, rationalist / EA group houses did tax COVID-risky activities. (Plausible, I don’t know that much about other group houses, but what I’ve heard doesn’t seem consistent with this story.)
That would have been a good solution, but it requires some effort to set up, and the benefits aren’t worth it. (Seems strange, especially after microCOVID existed it should take <10 person-hours to implement an actual system, and it sounds like group houses had a lot of COVID-related trouble that they would gladly have paid 10 person-hours to avoid. Maybe it takes much longer to agree on what system to implement, and that was the blocker? But didn’t people take lots of time deciding what system to implement anyway?)
That would have been a good solution, but EAs / rationalists systematically failed to think of it or implement it. (Why? This is basically just a Pigouvian tax, which I hear EAs / rationalists talk about all the time—in fact that’s how I learned the term.)
Our house implemented cap and trade (i.e. “You must impose at most X risk” instead of “You must pay $X per unit of risk.”).
Both yield efficient outcomes for the correct choice of X, so the question is just how well you can figure out the optimal levels of exposure vs. the marginal cost of exposure. If costs are linear in P(COVID) then the marginal cost is in some sense strictly easier (since the way you figure out levels is by combining marginal costs with the marginal cost of prevention) which is why you’d expect a Pigouvian tax to be better.
But a cap can still be easier to figure out (e.g. there is no way to honestly elicit costs from individuals when they have very different exposures to COVID, and the game theory of finding a good compromise is super complicated and who knows what’s easier). Caps also allow you to say things like “Look the total level of exposure is not that high as long as we are under this cap, so we can stop thinking about it rather than worrying that we’ve underestimated costs and may incur a high level of risk.” You could get the same benefit by setting an approximate cost and then revising if the total level goes above a threshold (and conversely in this approach you need to revisit the cap if the marginal cost of prevention goes too high, but who knows which of those is easier to handle).
Overall I don’t think our COVID response was particularly efficient/rational, due to a combination of having huge differences in beliefs/values and not wanting to spend much time dealing with it. We didn’t trade that much outside of couples. Most of our hassle went into resolving giant disagreements about the riskiness of activities (or dealing with estimating risks). I don’t think that doing slightly more negotiation to switch to a tax would have been the most cost-effective way to spend time to reduce our total COVID hassle.
Overall I still think that Pigouvian taxes will usually be more effective for a civilization facing this kind of question, but the costs and benefits of different policies are quite different when you are 7 people vs 70,000 people (since deliberation is much cheaper in the latter case). I expect cap and trade was basically fine but like you I’m interested in divergences between what looks like a good idea on paper and then what actually seemed reasonable in this tiny experiment. That said, I think the object-level arguments for implementing a Pigouvian tax here are much weaker than in typical cases where I complain about related civilization inadequacy because the random frictions are bigger.
I am curious about how different our cap ended up being from total levels of exposure under a Pigouvian tax. I think our cap was that each of us was exposed to <30 microcovids/day from the house (i.e. ~1%/year). I’d guess that the efficient level of exposure would have been somewhat higher.
If costs are linear in P(COVID) then the marginal cost is in some sense strictly easier (since the way you figure out levels is by combining marginal costs with the marginal cost of prevention) which is why you’d expect a Pigouvian tax to be better.
Yeah, that.
here is no way to honestly elicit costs from individuals when they have very different exposures to COVID, and the game theory of finding a good compromise is super complicated and who knows what’s easier
I’m definitely relying on some level of goodwill / cooperation / trying to find the best joint group decision, or something like that. (Though I think all systems rely on that at least somewhat.)
I think the object-level arguments for implementing a Pigouvian tax here are much weaker than in typical cases where I complain about related civilization inadequacy because the random frictions are bigger.
I guess you mean the random frictions in figuring out what system to use? One of the big reasons I prefer the Pigouvian tax over cap-and-trade is that you don’t have to trade to get the efficient outcome, which means after an initial one-time cost to set the price (and occasional checks to reset the price) everyone can just do their own thing without having to coordinate with others.
(Also, did most people who set a cap / budget then also trade? Seems pretty far from efficient if you neglect the “trade” part)
I am curious about how different our cap ended up being from total levels of exposure under a Pigouvian tax.
I just checked, and it looks like we had ~0.3% of (estimated) exposure over the course of roughly a year. I think it’s plausible though that we overestimated the risk initially and then failed to check later (in particular I think we used a too-high IFR, based on this comment).
At Event Horizon we had a policy for around 6-9 months where if you got a microcovid, you paid $1 to the house, and it was split between everyone else. Do whatever you like, we don’t mind, as long as you bring a microcovid estimate and pay the house.
At Event Horizon we had a policy for around 6-9 months where if you got a microcovid, you paid $1 to the house
That gives an implied cost of $1 million dollars for someone getting COVID-19, which seems way overpriced to me. I thought I’d do a quick Fermi estimate to verify my intuitions.
I don’t know how many people are in Event Horizon, but I’ll assume 15. Let’s say that on average about 10 people will get COVID-19 if one person gets it, due to some people being able to isolate successfully. I’m going to assume that the average age there is about 30, and the IFR is roughly 0.02% based on this paper. That means roughly 0.002 expected deaths will result. I’ll put the price of life at $10 million. I’ll also assume that each person loses two weeks of productivity equivalent to a loss of $20 per hour for 80 hours = $1600, and I’ll assume a loss of well-being equivalent to $10 per hour for 336 hours = $3360. Finally, I’ll assume the costs of isolation are $1,000 per person. Together, this combines to $10M x 0.002 + ($1600 + $3360) x 10 + $1000 x 15 = $84,600.
However, I didn’t include the cost of long-covid, which could plausibly raise this estimate radically depending on your beliefs. But personally I’m already a bit skeptical that 15 people would be willing to collectively pay $86,400 to prevent an infection in their house with certainty, so I still feel my initial intuition was mostly justified.
(I lived in this house) The estimate was largely driven by fear of long covid + a much higher value per hour of time, which also factored in altruistic benefits from housemate’s work that aren’t captured by the market price of their salary.
There were also about 8 of us, and we didn’t assume everyone would get it conditional on infection (household attack rates are much lower than that, and you might have time to react and quarantine). We assumed maybe like 2-3 others.
I totally expect we would have paid $84,600 to prevent a random one of us getting covid—and it would’ve even looked like a pretty cheap deal compared to getting it!
The estimate was largely driven by fear of long covid + a much higher value per hour of time
Makes sense, though FWIW I wasn’t estimating their wage at $20 an hour. Most cases are mild, and so productivity won’t likely suffer by much in most cases. I think even if the average wage there is $100 after taxes, per hour (which is pretty rich, even by Bay Area standards), my estimate is near the high end of what I’d expect the actual loss of productivity to be. Though of course I know little about who is there.
ETA: One way of estimating “altruistic benefits from housemate’s work that aren’t captured by the market price of their salary” is to ask at what after-tax wage you’d be willing to work for a completely pointless project, like painting a wall, for 2 weeks. If it’s higher than $100 an hour I commend those at Event Horizon for their devotion to altruism!
If it’s 8 hour workdays and 5 days a week, at $100/hour that’s 8 * 10 * 100 = $8k. No, you could not pay me $8k to stop working on the LW team for 2 weeks.
I’m kind of confused right now. At a mere $15k, you could probably get a pretty good software engineer to work for a month on any altruistic project you wish. I’m genuinely curious about why you think your work is so irreplaceable (and I’m not saying it isn’t!).
You could certainly hire a good software engineer at that salary, but I don’t think you could give them a vision and network and trust them to be autonomous. Money isn’t the bottleneck there. Just because you have the funding to hire someone for a role doesn’t mean you can. Hiring is incredibly difficult. Go see YC on hiring, or PG.
Most founding startup people are worth way more than their salary.
When my 15-person house did the calculation, we had a higher IFR estimate (I think 0.1%) and a 5x multiplier for long COVID, which gets you most of the way there. Not sure why we had a higher IFR estimate—it might be because we made this estimate in ~June 2020 when we had worse data, or plausibly IFR was actually higher then, or we raised it to account for the fact that some people were immunocompromised.
But personally I’m already a bit skeptical that 15 people would be willing to collectively pay $86,400 to prevent an infection in their house with certainty
(Fwiw, at < $6000 per person that seems like a bargain to me. At the full million, it would be ~$63,000 per person, which is now sounding iffy, but still plausible. Maybe it shouldn’t be plausible given how low the IFR is -- 0.02% does feel quite a bit lower than I had been imagining.)
Still, I think you shouldn’t ask about paying large sums of money—the utility-money curve is pretty sharply nonlinear as you get closer to 0 money, so the amount you’d pay to avoid a really bad thing is not 100x the amount you’d pay to avoid a 1% chance of that bad thing. (See also reply to TurnTrout below.)
You could instead ask about how much people would have to be paid for someone with COVID to start living at the house; this still has issues with nonlinear utility-money curves, but significantly less so than in the case where they’re paying. That is, would people accept a little under $6000 to have a COVID-infected person live with them?
Fwiw, at < $6000 per person that seems like a bargain to me
Possibly my intuition here comes from seeing COVID-19 risks as not too dissimilar from other risks for young people, like drinking alcohol or doing recreational drugs, accidental injury in the bathroom, catching the common cold (which could have pretty bad long-term effects), kissing someone (and thereby risk getting HSV-1 or the Epstein–Barr virus), eating unhealthily, driving, living in an area with a high violent crime rate, insufficiently monitoring one’s body for cancer, etc. I don’t usually see people pay similarly large costs to avoid these risks, which naturally makes me think that people don’t actually value their time or their life as much as they say.
One possibility is that everyone would start paying more to avoid these risks if they were made more aware of them, but I’m pretty skeptical. The other possibility seems more likely to me: value of life estimates are susceptible to idealism about how much people actually value their own life and time, and so when we focus on specific risk evaluations, we tend to exaggerate.
ETA: Another possibility I didn’t mention is that rationalists are just rich. But if this is the case, then why are they even in a group house? I understand the community aspect, but living in a group house is not something rich people usually do, even highly social rich people.
Still, I think you shouldn’t ask about paying large sums of money—the utility-money curve is pretty sharply nonlinear as you get closer to 0 money, so the amount you’d pay to avoid a really bad thing is not 100x the amount you’d pay to avoid a 1% chance of that bad thing.
So the $6000 cost is averting roughly 100 micromorts (~50% of catching it from the new person * 0.02% IFR), ignoring long COVID. Most of the things you list sound like < 1 micromort-equivalent per instance? That sounds pretty consistent.
E.g. Suppose unhealthy eating knocks off ~5 years of lifespan (let’s call that 10% as bad as death, i.e. 10^5 micromorts). You have 10^3 meals a year, times about 50 years, for 5 * 10^4 meals, so each meal is roughly 2 micromorts = $120 of cost. On this model, you should see people caring about their health, but not to an extraordinary degree, e.g. after getting the first 90% of benefit, then you stop (presumably you value a tasty meal at ~$12 more than a not-tasty meal, again thinking at the margin). And empirically that seems roughly right—most of the people I know think about health, try to get good macronutrient profiles, take supplements where relevant, but they don’t go around conducting literature reviews to figure out the optimal diet to consume.
Also, I think partly you might be underestimating how risk-avoiding people at Event Horizon and my house are—I’d say both houses are well above the typical rationalist. (And also that a good number of these people are in fact rich, if we count a typical software engineer as rich.)
Another possibility I didn’t mention is that rationalists are just rich. But if this is the case, then why are they even in a group house? I understand the community aspect, but living in a group house is not something rich people usually do, even highly social rich people.
There’s a pretty big culture difference between rationalists and stereotypical rich people. One of those is living in a group house. I currently prefer a group house over a traditional you-and-your-partner house regardless of how much money I have.
I ended up saying that long-covid costs were roughly the same as death, so it was a factor of 2x.
Price of a life at $10 million is a bit low, I put mine at $50 million, so a factor of 5x difference.
I didn’t follow all of your calculations about being out for 2 weeks and isolated, I basically just did those two (death and long covid) and it came to ~$200k for me. Roughly say that’s the average among 5 people and then you get to $1 per microcovid to the house.
My best guess is that rationalists aren’t that sane, especially when they’ve been locked up for a while and are scared and socially rewarding others being scared.
Part of the issue is that there’s rarely a natural way of pricing Pigouvian taxes. You can make price estimates based on how people hypothetically judge the harm to themselves, but there’s always going to be huge disagreements.
This flaw is a reasonable cause for concern. Suppose you were in a group house where half of the people worked remotely and the other half did not. The people who worked remotely might be biased (at least rhetorically) towards the proposition that the Pigouvian tax should be high, and the people who work in-person might be biased in the other direction. Why? Because if someone doesn’t expect to have to pay the tax, but does expect to receive the revenue, they may be inclined to overestimate the harm of COVID-19, as a way of benefiting from the tax, and vice versa.
In regards to carbon taxes, it’s often true that policies sound like the “obvious” thing to do, but actually have major implementation flaws upon closer examination. This can help explain why societies don’t do it, even if it seems rational. Noah Smith outlines the case against a carbon tax here,
This isn’t just politics; economists have forgotten basic Econ 101. Voters instinctively know what economists, for some mystifying reason, have seemed to ignore — the people who pay the costs of a carbon tax don’t reap the benefits. Carbon taxes are enacted locally, but climate change is a global phenomenon. That means that if Washington state taxes carbon, its own residents pay, but most of the benefit is reaped by people in other countries and other states. Thus, jurisdictions that choose not to enact carbon taxes can simply hope that someone else shoulders the cost of combating climate change. So no one ends up paying the cost.
Of course, this argument shouldn’t stop a perfectly altruistic community from implementing a carbon tax. But if the community was perfectly altruistic, the carbon tax would be unnecessary.
In regards to carbon taxes, it’s often true that policies sound like the “obvious” thing to do, but actually have major implementation flaws upon closer examination.
Tbc, I’m pretty sympathetic to this response to the general class of arguments that “society is incompetent because they don’t do X” (and it is the response I would usually make).
You can make price estimates based on how people hypothetically judge the harm to themselves, but there’s always going to be huge disagreements.
Yeah, I agree that in theory this could be a reason not to do it (though similar arguments also apply to other methods, e.g. in a budgeting system, people with remote jobs can push for a lower budget).
My real question though is: did people actually do this? Did they consider the possibility of a tax, discuss it, realize they couldn’t come to an agreement on price, and then implement something else? If so, that would answer my question, but I don’t think this is what happened.
My real question though is: did people actually do this? Did they consider the possibility of a tax, discuss it, realize they couldn’t come to an agreement on price, and then implement something else?
Probably not, although they lived in a society in which the response “just use Pigouvian taxes” was not as salient as it otherwise could have been in their minds. This reduced saliency was, I believe, at least partly due to fact that Pigouvian taxes have standard implementation issues. I meant to contribute one of these issues as a partial explanation, rather than respond to your question more directly.
Makes sense, thanks. I still feel confused about why they weren’t salient to EAs / rationalists, but I agree that the fact they aren’t salient more broadly is something-like-a-partial-explanation.
TBH I think what made the uCOVID tax work was that once you did some math, it was super hard to justify levels that would imply anything like the existing risk-avoidance behaviour. So the “active ingredient” was probably just getting people to put numbers on the cost-benefit analysis.
I feel like Noah’s argument implies that states won’t incur any costs to reduce CO2 emissions, which is wrong. IMO, the argument for a Pigouvian tax in this context is that for a given amount of CO2 reduction that you want, the tax is a cheaper way of getting it than e.g. regulating which technologies people can or can’t use.
IMO, the argument for a Pigouvian tax in this context is that for a given amount of CO2 reduction that you want, the tax is a cheaper way of getting it
Since the argument about internalizing externalities fails in this case (as the tax is local), arguably the best way of modeling the problem is viewing each community as having some degree of altruism. Then, just as EAs might say “donate 10% of your income in a cause neutral way” the argument is that communities should just spend their “climate change money” reducing carbon in the way that’s most effective, even if it’s not rationalized in some sort of cost internalization framework. And Noah pointed out in his article (though not in the part I quoted) that R&D spending is probably more effective than imposing carbon taxes.
Note that a) some group houses just did this, b) a major answer for why people didn’t do particularly novel things with microcovid was “by the time it came out, people were pretty exhausted out from covid negotiation, and doing whatever default thing was suggested was easier.”
a) Do you have a sense for the proportion of group houses that did it? And the proportion of group houses that seriously considered it? (My guess would be that 10-20% did it, and an additional 10% considered it.)
Re: b) That does seem like a good chunk of the explanation, thanks. I do expect the Pigouvian tax would have been a better policy even prior to microcovid.org existing, given how much knowledge about COVID people had, so I’m still wondering why it wasn’t considered even before microcovid.org existed.
(I remember doing explicit risk calculations back in April / May 2020, and I think there’s a good chance we would have implemented a similar Pigouvian tax system even without microcovid existing, with worse risk estimates.)
I actually guess even fewer houses than you’re thinking did it (I think I only know if like 1-3).
In my own house, where I think we could have come up with the Pigouvian tax, I think when we did all our initial negotiations in April, I think the thinking was “hunker down for a month while we wait to see how bad Covid actually is, to avoid tail risks of badness, and then re-evaluate” but then it turned out by the time we got to the “re-evaluate” step, people were burned out on negotiation.
I like this question. If I had to offer a response from econ 101:
Suppose people love eating a certain endangered species of whale, and that people would be sad if the whale went extinct, but otherwise didn’t care about how many of these whales there were. Any individual consumer might reason that their consumption is unlikely to cause the whale to go extinct.
We have a tragedy of the commons, and we need to internalize the negative externalities of whale hunting. However, the harm is discontinuous in the number of whales remaining: there’s an irreversible extinction point. Therefore, Pigouvian taxes aren’t actually a good idea because regulators may not be sure what the post-tax equilibrium quantity will be. If the quantity is too high, the whales go extinct.
Therefore, a “cap and trade” program would work better: there are a set number of whales that can be killed each year, and firms trade “whale certificates” with each other. (And, IIRC, if # of certificates = post-tax equilibrium quantity, this scheme has the same effect as a Pigouvian tax of the appropriate amount.)
Similarly: if I, a house member, am unsure about others’ willingness to pay for risky activities, then maybe I want to cap the weekly allowable microcovids and allow people to trade them amongst themselves. This is basically a fancier version of “here’s the house’s weekly microcovid allowance” which I heard several houses used. I’m protecting myself against my uncertainty like “maybe someone will just go sing at a bar one week, and they’ll pay me $1,000, but actually I really don’t want to get sick for $1,000.” (EDIT: In this case, maybe you need to charge more per microcovid? This makes me less confident in the rest of this argument.)
There are a couple of problems with this argument. First, you said taxes worked fine for your group house, which somewhat (but not totally) discredits all of this theorizing. Second, (4) seems most likely. Otherwise, I feel like we might have heard about covid taxes being considered and then discarded (in e.g. different retrospectives)?
EDIT: In this case, maybe you need to charge more per microcovid? This makes me less confident in the rest of this argument.
Yeah, this. The beautiful thing about microCOVIDs is that because they are probabilities, the goodness of an outcome really is linear in terms of microCOVIDs incurred, and so the “cost” of incurring a microCOVID is the same no matter “when” you incur it, so it’s very easy to price. (Unlike the whale example, where the goodness of the outcome is not linear in the number of whales, and so killing a single whale has different costs depending on when exactly it happens.)
You might still end up with nonlinear costs if your value of money is nonlinear on the relevant scale, e.g. maybe the first $1,000 is really great but the next $10,000 isn’t 10x as great, and so you need to be paid more after the first $1,000 for the same number of microcovids, but I don’t think this is really how people in our community feel?
I guess another way you get nonlinear costs is if you really do need to incur some microcovids, and then the amount you pay matters a lot—maybe the first $10 is fine, but then $1,000 isn’t, because you don’t have a huge financial buffer to draw from, so while the downside of a microcovid stays constant, the downside of paying money for it changes. I didn’t get the sense that this would be a real problem for most group houses, since people were in general being very cautious and so wouldn’t have paid much, but maybe it would have affected things. Partly for this reason and partly out of a sense of fairness, at my group house we didn’t charge for “essential” microcovids, such as picking up drug prescriptions (assuming you couldn’t get them delivered) or (in my case) an in-person appointment to get a visa.
Re 1, we ran into some of the issues Matthew brought up, but all other COVID policies are implicitly valuing risk at some dollar amount (possibly inconsistently), so the Pigouvian tax seemed like the best option available.
Carbon taxes are useful for market transactions. A lot of interactions within a group house aren’t market transactions. Decisions about who brings out the trash aren’t made through market mechanisms. Switching to making all the transactions in a group house market based will create a lot of conflict and isn’t just about how to deal with COVID-19.
Using a market-based mechanism in an enviroment where the important decisions are market-based is easier then introducing a market based mechanism in an enviroment where most decisions are not.
If you introduce a market-based mechanism around COVID-19 you get a result where rich members in the house can take more risk then the poorer ones which goes against assumptions of equality between house members (and most group houses work on assumptions of equality).
Personally, I don’t really feel the force of this argument—I feel like on either side I get a good deal (on the rich side, I get to do more things, on the poor side, I get paid more money than I would pay to avoid the risk). I agree other people feel the force of this though, and I don’t really know why.
(But like, also, shouldn’t this apply to carbon taxes or all the other economic arguments that civilization is “insane” for not doing?)
(Also also, don’t we already see e.g. rich members getting larger, nicer rooms than poorer members? What’s the difference?)
(Chores are different in that they aren’t a very big deal. If they are a big deal to you, then you hire a cleaner. If they’re not a big enough deal that you’d hire a cleaner, then they’re not a big enough deal to bother with a market, which does have transaction costs.)
As a single data point, the COVID tax didn’t create conflict in my group house (despite having non-trivial income inequality, and one of the richer housemates indeed taking on more risk than others), though admittedly my house is slightly more market-transaction-y than most.
What won’t we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does_ seem likely (i.e. it’s near the boundary separating “likely” from “unlikely”).
One decent answer is that I don’t expect we’ll have AI systems that could write new posts _on rationality_ that I like more than the typical LessWrong post with > 30 karma. However, I do expect that we could build an AI system that could write _some_ new post (on any topic) that I like more than the typical LessWrong post with > 30 karma. This is because (1) 30 karma is not that high a filter and includes lots of posts I feel pretty meh about, (2) there are lots of topics I know nothing about, on which it would be relatively easy to write a post I like, and (3) AI systems easily have access to this knowledge by being trained on the Internet. (It is another matter whether we actually build an AI system that can do this.) Note that there is still a decently large difference between these two tasks—the content would have to be quite a bit more novel in the former case (which is why I don’t expect it to be solved by 2025).
Note that I still think it’s pretty hard to predict what will and won’t happen, so even for this example I’d probably assign, idk, a 10% chance that it actually does work out (if we assume some organization tries hard to make it work)?
Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.
I think I’d put something more like 50% on “Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.” That’s just a wild guess, very unstable.
Another potential prediction generation methodology: Name something that you think won’t happen, but you think I think will.
Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.
This seems more feasible, because you can cherrypick a single good example. I wouldn’t be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I’d still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right. (EDIT: Rereading this, I have no idea whether I was considering a timeline of 2025 (as in my original comment) or 2030 (as in the comment I’m replying to) when making this prediction.)
Name something that you think won’t happen, but you think I think will.
I spent a bit of time on this but I think I don’t have a detailed enough model of you to really generate good ideas here :/
Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I’d expect to see things like:
An AI system that can create a working website with the desired functionality “from scratch” (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, …). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors or issue shell commands to set up the web server).
At least one large, major research area in which human researcher productivity has been boosted 100x relative to today’s levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs.
An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans. (EDIT: I failed to think about karma inflation when making this prediction and feel a bit worse about it now.)
Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant.
Currently, I don’t expect to see any of these by 2030.
Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like “And the blog post wasn’t cherry-picked; the same system could be asked to make 2 additional posts on rationality and you’d like both of them also.” I’m not sure what credence I’d give to this but it would probably be a lot higher than 10%.
Website prediction: Nice, I think that’s like 50% likely by 2030.
Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don’t need to do actual experiments anymore!), would that count? If you hadn’t heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030.
20,000 LW karma: Holy shit that’s a lot of karma for one year. I feel like it’s possible that would happen before it’s too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it’ll happen before 2030 it doesn’t serve as a good forecast because it’ll be too late by that point IMO.
Productivity tool UI’s obsolete thanks to assistants: This is a good one too. I think that’s 50% likely by 2030.
I’m not super certain about any of these things of course, these are just my wild guesses for now.
20,000 LW karma: Holy shit that’s a lot of karma for one year.
I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way. 50 karma posts are good but don’t have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn’t be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI.
I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don’t think I’d count AlphaFold.)
OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.
That said, I don’t think this is that likely I guess… probably AI will be unable to do even three such posts, or it’ll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.
But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting?
I’d be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts.
(In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)
Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.
Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:
Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))
Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)
which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)
So, you must prefer the always-∅ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.
----
Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?
Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.
Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)
As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))
Plugging this into the original equation recovers the divergence to negative infinity that we saw before.
But let’s assume that we just do a constant scaling to avoid this divergence:
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)
Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))
The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))
=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)
The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:
=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])
So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
Malleable motivations: There is a “nearby” model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
Strong optimization: If there’s a “nearby” setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like “the developers wrote correct code” and background technical facts like “the shape of the loss landscape is favorable”.
Correct rewards: You accurately detect when a model output is a failure vs not a failure.
Good exploration: During finetuning there are many different inputs that trigger the failure.
(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I’m going to ignore these complications and keep talking as though they are discrete properties.)
Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)
Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we’ve already established that you get something at least as good as M_good.)
Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.
Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I’m primarily thinking about cases where (4) and/or (5) fail to hold.
In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it’s easy for humans to give rewards), and there are many examples of them already. With Bing it’s more plausible that (5) doesn’t hold.
To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn’t hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.
The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).
Let’s consider a model where there are clusters{ci}, where each cluster contains trajectories whose features are identical ci={τ:ϕ(τ)=ϕci} (which also implies rewards are identical). Let c(τ) denote the cluster that τ belongs to. The Boltzmann model says p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′)). The LESS model says p(τ∣θ)=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅1|c(τ)| , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.
(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these “clusters”; I’m introducing them as a simpler situation where we can understand what’s going on formally.)
In this model, a “sparse region of demonstration-space” is a cluster c with small cardinality |c|, whereas a dense one has large |c|.
Let’s first do some preprocessing. We can rewrite the Boltzmann model as follows:
Where for LESS p(c) is uniform i.e. p(c)∝1, whereas for Boltzmann p(c)∝|c|, i.e. a denser cluster is more likely to be sampled.
So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We’ll assume that LESS is the “correct” way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.
The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its “prior” over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn’t work—it only claims that pBoltzmann(τ)<pLESS(τ), but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let’s look at the reward learning update:
In the last step, any linear terms in p(τ∣θ) that didn’t depend on θ cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of “the prior is lower, therefore it updates more strongly” doesn’t seem to be reflected here.
Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose—the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster c(τ) it is in). So from now on I’ll just talk about selecting clusters, and updating on them. I’ll also write ERθ(c)=exp(Rθ(c)) for conciseness.
The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of p(c). So let’s consider just that last term. Denoting the vector of priors on all classes as →p, and similarly the vector of exponentiated rewards as →ERθ, the last term becomes →p⋅→ERθ2→p⋅→ERθ1=|→ERθ2||→ERθ1|⋅cos(α2)cos(α1), where αi is the angle between →p and →ERθi. Again, the first term doesn’t differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio cos(α2)cos(α1).
What happens when the chosen class c is sparse? Without loss of generality, let’s say that ERθ1(c)>ERθ2(c); that is, θ1 is a better fit for the demonstration, and so we will update towards it. Since c is sparse, p(c) is smaller for Boltzmann than for LESS—which probably means that it is better aligned with θ2, which also has a low value of ERθ2(c) by assumption. (However, this is by no means guaranteed.) In this case, the ratio cos(α2)cos(α1) above would be higher for Boltzmann than for LESS, and so it would more strongly update towards θ1, supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.
(Note it does make sense to analyze the effect on the θ that we update towards, because in reward learning we care primarily about the θ that we end up having higher probability on.)
I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.
Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:
Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]
where s′ is the baseline state and g is the future goal.
In a deterministic environment, this can be rewritten as:
V∗future(s,s′)
=Eg[γmax(n,n′)]
=Eg[min(R(s,g),R(s′,g))]
=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−dRR(s,s′)
Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.
Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.
The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):
This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):
Terminal states:
rnew(st)
=r(st)+βV∗future(st,s′t)
=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]
Non-terminal states:
rnew(st,at)
=r(st,at)+(1−γ)βV∗future(st,s′t)
=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]
For comparison, the original relative reachability reward is:
rRR(st,at)=r(st)−βdRR(st,s′t)
The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.
All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:
Non-terminal states:
rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=r(st)−βdRR(st,s′t)
What used to be terminal states that are now self-loop states:
rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)
Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:
First, the future state rewards have an extra term, kEg[R(s′t,g)].
This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.
For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.
Second, in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)
Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.
(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)
I often search through the Alignment Newsletter database to find the exact title of a relevant post (so that I can link to it in a new summary), often reading through the summary and opinion to make sure it is the post I’m thinking of.
Frequently, I read the summary normally, then read the first line or two of the opinion and immediately realize that it wasn’t written by me.
This is kinda interesting, because I often don’t know what tipped me off—I just get a sense of “it doesn’t sound like me”. Notably, I usually do agree with the opinion, so it isn’t about stating things I don’t believe. Nonetheless, it isn’t purely about personal writing styles, because I don’t get this sense when reading the summary.
(No particular point here, just an interesting observation)
How confident are you that this isn’t just memory? I personally think that upon rereading writing, it feels significantly more familiar if i wrote it, than if I read and edited it. A piece of this is likely style, but I think much of it is the memory of having generated and more closely considered it.
It’s plausible, though note I’ve probably summarized over a thousand things at this point so this is quite a demand on memory.
But even so it still doesn’t explain why I don’t notice while reading the summary but do notice while reading the opinion. (Both the summary and opinion were written by someone else in the motivating example, but I only noticed from the opinion.)
But even so it still doesn’t explain why I don’t notice while reading the summary but do notice while reading the opinion. (Both the summary and opinion were written by someone else in the motivating example, but I only noticed from the opinion.)
Ah, this helps clarify. My hypotheses are then:
Even if you “agree” with an opinion, perhaps you’re highly attuned, but in a possibly not straightforward conscious way, to even mild (e.g. 0.1%) levels of disagreement.
Maybe the word choice you use for summaries is much more similar to others vs the word choice you use for opinions.
Perhaps there’s just a time lag, such that you’re starting to feel like a summary isn’t written by you but only realize by the time you get to the later opinion.
The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
L(θT)−L(θ0)=∑tL(θt)−L(θt−1)
And then to decompose training loss across specific parameters:
L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ
I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:
L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).
(This is pretty standard, but I’ve included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual parameters:
So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!
Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:
1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.
2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.
3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.
I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.
----
Derivation of turning the path integral into a dot product with an average:
This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]
This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)
One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.
I don’t buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is ln(1/N)=−2.3 when N=10 (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we’d expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.
There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy “overwhelms” the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can’t be true. When training on just L2 regularization, the gradient descent update is:
w=w−λw=(1−λ)w=cw for some constant c.
For MLPs with relu activations and no biases, if you multiply all the weights by c, the logits get multiplied by cd (where d is the depth of the network), no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can’t see a double descent on test error in this setting. (This doesn’t eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can’t happen in the “first train to zero error with cross-entropy and then regularize” setting.)
It is possible that double descent doesn’t happen for MLPs with relu activations and no biases, but given how many other settings it seems to happen in I would be surprised.
I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you’d immediately reason that you couldn’t know how I would update, without first knowing what I expected to see in advance. Which you very clearly don’t know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn’t give you much evidence if you don’t know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn’t already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn’t what’s going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their very-short timelines and my somewhat-short timelines, and then they want to figure out what causes this discrepancy, and an easily-available question is “why doesn’t X imply short timelines” and then for some reason that I still don’t understand they instead substitute the much worse question of “why didn’t you update towards short timelines on X” without noticing its major flaws.
Fwiw, I was extremely surprised by OpenAI Five working with just vanilla PPO (with reward shaping and domain randomization), rather than requiring any advances in hierarchical RL. I made one massive update then (in the sense that I immediately started searching for a new model that explained that result; it did take over a year to get to a model I actually liked). I also basically adopted the bio anchors timelines when that report was released (primarily because it agreed with my model, elaborated on it, and then actually calculated out its consequences, which I had never done because it’s actually quite a lot of work). Apart from those two instances I don’t think I’ve had major timeline updates.
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors.
To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress.
I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines.
Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.
I agree your timelines can and should shift based on evidence even if you continue to believe in the bio anchors framework.
Personally, I completely ignore the genome anchor, and I don’t buy the lifetime anchor or the evolution anchor very much (I think the structure of the neural net anchors is a lot better and more likely to give the right answer).
Animals with smaller brains (like bees) are capable of few-shot learning, so I’m not really sure why observing few-shot learning is much of an update. See e.g. this post.
Essentially, the problem is that ‘evidence that shifts Bio Anchors weightings’ is quite different, more restricted, and much harder to define than the straightforward ‘evidence of impressive capabilities’. However, the reason that I think it’s worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being ‘another AI winter, scaling laws bend, and we don’t get excellent human-level performance on short term language-specified tasks anytime soon’, and don’t see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn’t another AI winter, rather it’s that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn’t come “for free” from competence on short-term tasks, if you’re as sample-inefficient as current ML is.
So what we’re really looking for isn’t systems getting progressively better and better at short-horizon language tasks. That’s something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.
We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren’t vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.
As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that’s not especially unlikely.
What’s the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?
There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that’s “human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales” that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.
However, I’d consider this changing the subject: essentially it’s not an argument for AGI takeover soon, rather it’s an argument for ‘certain narrow AIs are far more dangerous than you realize’. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can’t rely on the simple “it’ll be superintelligent and seize a DSA”.
Suppose we get such narrow AIs, that can do most short-term tasks for which there’s data, but don’t generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren’t competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people’s attention. Overall, it doesn’t seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.
The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you’re excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI—when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.
As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that’s not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don’t observe the described outcome of them generalizing out to much longer-term tasks than that.
So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can’t be “nothing”, since if we build AGI in 5 years, that counts.
I think the answer is, anything that looks like unexpectedly cheap, easy, ‘for free’ generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.
This is different from many of the other signs of impressiveness we’ve seen recently: just learning lots of shorter-horizon tasks without much transfer between them, being able to point models successfully at particular short-horizon tasks with good prompting, getting much better at a wider range of tasks that can only be done over short horizons. All of these are expected on either view.
This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we’ll see a certain degree of generalizing from shorter to longer horizon tasks, and that we’ll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we’ll get enough of such generalization to get AGI. I guess we just need to look for ‘more of it than we expected to see’?
That seems very hard to judge, since you can’t read off predictions about subhuman capabilities from bio anchors like that.
Yeah, this all seems right to me.
It does not seem to me like “can keep a train of thought running” implies “can take over the world” (or even “is comparable to a human”). I guess the idea is that with a train of thought you can do amplification? I’d be pretty surprised if train-of-thought-amplification on models of today (or 5 years from now) led to novel high quality scientific papers, even in fields that don’t require real-world experimentation.
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos!
I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor.
I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.
Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?
(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)
First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.
Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.
This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.
A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.
From the Truthful AI paper:
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
Consider two methods of thinking:
1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model
2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.
I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.
However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.
I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?
Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).
Options 1 & 2 sound to me a lot like inside view and outside view. Fair?
Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.
EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.
Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?
(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)
Correct.
I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
Good point!
I recently interviewed someone who has a lot of experience predicting systems, and they had 4 steps similar to your two above.
Observe the world and see if it’s sufficient to other systems to predict based on intuitionistic analogies.
If there’s not a good analogy, Understand the first principles, then try to reason about the equilibria of that.
If that doesn’t work, Assume the world will stay in a stable state, and try to reason from that.
If that doesn’t work, figure out the worst case scenario and plan from there.
I think 1 and 2 are what you do with expertise, and 3 and 4 are what you do without expertise.
Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.
I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.
I have been doing political betting for a few months and informally compared my success with strategies 1 and 2.
Ex. Predicting the Iranian election
I write down the 10 most important iranian political actors (Khameini, Mojtaza, Raisi, a few opposition leaders, the IRGC commanders). I find a public statement about their prefered outcome, and I estimate their power and salience. So Khameini would be preference = leans Raisi, power = 100, salience = 40. Rouhani would be preference = strong Hemmeti, power = 30, salience = 100. Then I find the weighted average position. It’s a bit more complicated because I have to linearize preferences, but yeah.
The two strat is to predict repeated past events. The opposition has one the last three contested elections in surprise victories, so predict the same outcome.
I have found 2 is actually pretty bad. Guess I’m an expert tho.
That seems like a pretty bad 2-strat. Something that has happened three times is not a “stable high-level feature of the world”. (Especially if the preceding time it didn’t happen, which I infer since you didn’t say “the last four contested elections”.)
If that’s the best 2-strat available, I think I would have ex ante said that you should go with a 1-strat.
Haha agreed.
One way to communicate about uncertainty is to provide explicit probabilities, e.g. “I think it’s 20% likely that [...]”, or “I would put > 90% probability on [...]”. Another way to communicate about uncertainty is to use words like “plausible”, “possible”, “definitely”, “likely”, e.g. “I think it is plausible that [...]”.
People seem to treat the words as shorthands for probability statements. I don’t know why you’d do this, it’s losing information and increasing miscommunication for basically no reason—it’s maybe slightly more idiomatic English, but it’s not even much longer to just put the number into the sentence! (And you don’t have to have precise numbers, you can have ranges or inequalities if you want, if that’s what you’re using the words to mean.)
According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren’t making a decision, but you’re talking to someone who might use your advice to make decisions, but isn’t going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn’t have much to do with probabilities.
A simple model here is that your reasoning algorithm has two types of objects: first, strong constraints that rule out some possible worlds (“An AGI system that is widely deployed will have a massive impact”), and second, an implicit space of imagined scenarios in which each scenario feels similarly unsurprising if you observe it (this is the human emotion of surprise, not the probability-theoretic definition of surprise; one major difference is that the human emotion of surprise often doesn’t increase with additional conjuncts). Then, we can define the meanings of the words as follows:
Plausible: “This is within my space of imagined scenarios.”
Possible: “This isn’t within my space of imagined scenarios, but it doesn’t violate any of my strong constraints.” (Though unfortunately it can also mean “this violates a strong constraint, but I recognize that my constraint may be wrong, so I don’t assign literally zero probability to it”.)
Definitely: “If this weren’t true, that would violate one of my strong constraints.”
Implausible: “This violates one of my strong constraints.” (Given other definitions, this should really be called “impossible”, but unfortunately that word is unavailable (though I think Eliezer uses it this way anyway).)
When someone gives you an example scenario, it’s much easier to label it with one of the four words above, because their meanings are much much closer to the native way in which brains reason (at least, for my brain). That’s why I end up using these words rather than always talking in probabilities. To convert these into actual probabilities, I then have to do some additional reasoning in order to take into account things like model uncertainty, epistemic deference, conjunctive fallacy, estimating the “size” of the space of imagined scenarios to figure out the average probability for any given possibility, etc.
I like this, but it feels awkward to say that something can be not inside a space of “possibilities” but still be “possible”. Maybe “possibilities” here should be “imagined scenarios”?
That does seem like better terminology! I’ll go change it now.
I like this experiment! Keep ’em coming.
“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.
Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).
I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?
(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having some kind of link”)
That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”
I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.
I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.
I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.
Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.
This is not just one man’s modus ponens—the key issue is the selection effect.
It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost all of them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).
However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)
Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.
In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.
By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.
Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.
You’ve heard of crucial considerations, but have you heard of red herring considerations?
These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.
To solve a problem quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.
For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.
Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).
Alternate names: sham considerations? insignificant considerations?
When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”
An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:
In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.
It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.
Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.
I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).
Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.
Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:
Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)
Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.
At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).
Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.
----
What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.
An argument form that I like:
I think this should be convincing even if Y is false, unless you can explain why your argument for X does not work under assumption Y.
An example: any AI safety story (X) should also work if you assume that the AI does not have the ability to take over the world during training (Y).
Trying to follow this. Doesn’t the Y (AI not taking over the world during training) make it less likely that X(AI will take over the world at all)?
Which seems to contradict the argument structure. Perhaps you can give a few more examples to make more clear the structure?
In that example, X is “AI will not take over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.
I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.
EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.
In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.
Maybe another way to say this is:
I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)
“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)
The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.
A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)
However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.
I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?
Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.
(This is under the simple model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)
Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).
This is true if “the standard setting” refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer’s curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.
(Where I’m using “more robust evidence” to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer’s curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
Agreed.
(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer’s curse, though nowadays I mostly don’t hear it cited at all.)
It’s common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don’t have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.
I’ll go through the articles I’ve read that argue for worrying about recommender systems, and explain why I find them unconvincing. I’ve only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.
Aligning Recommender Systems as Cause Area. I responded briefly on the post. Their main arguments and my counterarguments are:
A few sources say that it is bad + it has incredible scale + it should be super easy to solve. (I don’t trust the sources and suspect the authors didn’t check them; I agree there’s huge scale; I don’t see why it should be super easy to solve even if there is a problem, especially given that many of the supposed problems seem to have existed before recommender systems.)
Maybe working on recommender systems would have spillover effects on AI alignment. (This seems dominated by just working directly on AI alignment. Also the core feature of AI alignment is that the AI system deliberately and intentionally does things, and creates plans in new situations that you hadn’t seen before, which is not the case with recommender systems, so I don’t expect many spillover effects.)
80K podcast with Tristan Harris. This was actively annoying for a variety of reasons:
I don’t know what the main claim was. Ostensibly it was meant to be “it is bad that companies have monetized human attention since this leads to lots of bad incentives and bad outcomes”. But then so many specific things mentioned have nothing to do with this claim and instead seem to be a vague general “tech companies are bad”. Most egregiously, in section Global effects [01:02:44], Rob argues “WhatsApp doesn’t have ads / recommender systems, so it acts as a control group, but it too has bad outcomes, doesn’t this mean the problem isn’t ads / recommender systems?” and Tristan says “That’s right, WhatsApp is terrible, it’s causing mass lynchings” as though that supports his point.
When Rob made some critique of the main argument, Tristan deflected with an example of tech doing bad things. But it’s always vaguely related, so you think he’s addressing the critique, even though he hasn’t actually. (I’m reminded of the Zootopia strategy for press conferences.) See sections “The messy real world vs. an imagined idealised world [00:38:20]” (Rob: weren’t negative things happening before social media? Tristan: it’s easy to fake credibility in text), “The persuasion apocalypse [00:47:46]” (Rob: can’t one-on-one conversations be persuasive too? Tristan: you can lie in political ads), “Revolt of the Public [00:56:48]” (Rob: doesn’t the internet allow ordinary people to challenge established institutions in good ways? Tristan: Alex Jones has been recommended 15 billion times.)
US politics [01:13:32] is a rare counterexample, where Rob says “why aren’t other countries getting polarized”, and Tristan replies “since it’s a positive feedback loop only countries with high initial polarization will see increasing polarization”. It’s not a particularly convincing response, but at least it’s a response.
Tristan seems to be very big on “the tech companies changed what they were doing, that proves we were right”. I think it is just as consistent to say “we yelled at the companies a lot and got the public to yell at them too, and that caused a change, regardless of whether the problem was serious or not, or whether the solution was net positive or not”.
The second half of the podcast focuses more on solutions. Given that I am unconvinced about the problem, I wasn’t all that interested, but it seemed generally reasonable.
(This post responds to the object level claims, which I have not done because I don’t know much about the object level.)
There’s also the documentary “The Social Dilemma”, but I expect it’s focused entirely on problems, probably doesn’t try to have good rigorous statistics, and surely will make no attempt at a cost-benefit analysis so I seriously doubt it would change my mind on anything. (And it is associated with Tristan Harris so I’d assume that most of the relevant details would have made it into the 80K podcast.)
Recommender systems are still influential, and you could want to work on them just because of their huge scale. I like Designing Recommender Systems to Depolarize as an example of what this might look like.
Thanks for this Rohin. I’ve been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven’t actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I’d love to see someone really lay out the arguments and analyze them… My current take is that (some of) the armchair theories seem pretty plausible to me, such that I’d believe them unless the data contradicts. But I’m extremely uncertain about this.
I should note that there’s a big difference between “recommender systems cause polarization as a side effect of optimizing for engagement” and “we might design tools that explicitly aim at persuasion / propaganda”. I’m confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it’s effects will be.
Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better than a monkey throwing darts, so I’m more in the position of “yup, sounds plausible, but that doesn’t constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out”.
Oh, then maybe we don’t actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it’s also totally plausible that polarization is going up for some other reason(s). My concern (as illustrated in the vignette I wrote) is that we seem to be on a slippery slope to a world where persuasion/propaganda is more effective and widespread than it has been historically, thanks to new AI and big data methods. My model is: Ideologies and other entities have always been using propaganda of various kinds, and there’s always been a race between improving propaganda tech and improving truth-finding tech, but we are currently in a big AI boom and in particular in a Big Data and Natural Language Processing boom, and this seems like it’ll be a big boost to propaganda tech, and unfortunately I can’t think of ways in which it will correspondingly boost truth-finding-ness across society, because while it can be used to make truth-finding tech maybe (e.g. prediction markets, fact-checkers, etc.) it seems like most people in practice just don’t want to adopt truth-finding tech. It’s true that we could design a different society/culture that used all this awesome new tech to be super truth-seeking and have a very epistemically healthy discourse, but it seems like we are not about to do that anytime soon, instead we are going in the opposite direction.
I think that story involves lots of assumptions I don’t immediately believe (but don’t disbelieve either):
People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
Such people will quickly realize that AI will be very useful for this
They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
The resulting AI system will in fact be very good at persuasion / propaganda
AI that fights persuasion / propaganda either won’t be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can’t keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won’t be true with AI)
And probably there are a bunch of other assumptions I haven’t even thought to question.
I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be “raise awareness”, it should be “figure out whether the assumptions are justified”.
That’s all I’m trying to do at this point, to be clear. Perhaps “raise awareness” was the wrong choice of phrase.
Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in mind. [EDIT: But I agree with you that it’s all super unclear and more research is needed to have confidence in any of this.]
Excellent :)
(Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above.
Maybe your point is that your story involves “silos” of Internet-space within which particular ideologies / propaganda reign supreme. I don’t really see that as changing my object-level points very much but perhaps I’m missing something.
I was confusing, sorry—what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is… loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one:
People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
This is already happening in 2021 and previous, in my story it happens more.
Such people will quickly realize that AI will be very useful for this
Again, this is already happening.
They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
Plenty of people are already raising a moral outcry. In my story these people don’t succeed in getting it banned, but I agree the story could be wrong. I hope it is!
The resulting AI system will in fact be very good at persuasion / propaganda
Yep. I don’t have hard evidence, but intuitively this feels like the sort of thing today’s AI techniques would be good at, or at least good-enough-to-improve-on-the-state-of-the-art.
AI that fights persuasion / propaganda either won’t be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can’t keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won’t be true with AI)
I think it won’t be built & deployed in such a way that collective epistemology is overall improved. Instead, the propaganda-fighting AIs will themselves have blind spots, to allow in the propaganda of the “good guys.” The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc. (I think what happened with the internet is precedent for this. In theory, having all these facts available at all of our fingertips should have led to a massive improvement in collective epistemology and a massive improvement in truthfulness, accuracy, balance, etc. in the media. But in practice it didn’t.) It’s possible I’m being too cynical here of course!
I don’t think it’s designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions).
I think it’s fair to say it’s “loaded”, in the sense that I am trying to push towards questioning those assumptions, but I don’t think I’m doing anything epistemically unvirtuous.
This does not seem obvious to me (but I also don’t pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious).
That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don’t really see “number of facts” as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.
(I just tried to find the best argument that GMOs aren’t going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won’t cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)
Perhaps I shouldn’t have mentioned any of this. I also don’t think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time.
The first bit seems in tension with the second bit, no? At any rate, I also don’t see number of facts as the relevant thing for epistemology. I totally agree with your take here.
“Truthful counterarguments” is probably not the best phrase; I meant something more like “epistemically virtuous counterarguments”. Like, responding to “what if there are long-term harms from COVID vaccines” with “that’s possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer” rather than “there is no evidence of long-term harms”.
If you look at my posting history, you’ll see that all posts I’ve made on LW (two!) are negative toward social media and one calls out recommender systems explicitly. This post has made me reconsider some of my beliefs, thank you.
I realized that, while I have heard Tristan Harris, read The Attention Merchants, and perused other, similar sources, I haven’t looked for studies or data to back it all up. It makes sense on a gut level—that these systems can feed carefully curated information to softly steer a brain toward what the algorithm is optimizing for—but without more solid data, I found I can’t quite tell if this is real or if it’s just “old man yells at cloud.”
Subjectively, I’ve seen friends and family get sucked into social media and change into more toxic versions of themselves. Or maybe they were always assholes, and social media just lent them a specific, hivemind kind of flavor, which triggered my alarms? Hard to say.
Thanks, that’s good to hear.
Fwiw, I am a lot more compelled by the general story “we are now seeing examples of bad behavior from the ‘other’ side that are selected across hundreds of millions of people, instead of thousands of people; our intuitions are not calibrated for this” (see e.g. here). That issue seems like a consequence of more global reach + more recording of bad stuff that happens. Though if I were planning to make it my career I would spend way more time figuring out whether that story is true as well.
This was a good post. I’d bookmark it, but unfortunately that functionality doesn’t exist yet.* (Though if you have any open source bookmark plugins to recommend, that’d be helpful.) I’m mostly responding to say this though:
While it wasn’t otherwise mentioned in the abstract of the paper (above), this was stated once:
I though this was worth calling out, although I am still in the process of reading that 10⁄14 page paper. (There are 4 pages of references.)
And some other commentary while I’m here:
I imagine the recommender system is only as good as what it has to work with, content wise—and that’s before getting into ‘what does the recommender system have to go off of’, and ‘what does it do with what it has’.
This part wasn’t elaborated on. To put it a different way:
Do the people ‘who know what’s going’ on (presumably) have better arguments? Do you?
*I also have a suspicion it’s not being used. I.e., past a certain number of bookmarks like 10, it’s not actually feasible to use the LW interface to access them.
Possibly, but if so, I haven’t seen them.
My current belief is “who knows if there’s a major problem with recommender systems or not”. I’m not willing to defer to them, i.e. say “there probably is a problem based on the fact that the people who’ve studied them think there’s a problem”, because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources).
No. I just have anecdotal experience + armchair speculation, which I don’t expect to be much better at uncovering the truth than the arguments I’m critiquing.
This might still be good for generating ideas (if not far more accurate than brainstorming or trying to come up with a way to generate models via ‘brute force’).
But the real trick is—how do we test these sorts of ideas?
Agreed this can be useful for generating ideas (and I do tons of it myself; I have hundreds of pages of docs filled with speculation on AI; I’d probably think most of it is garbage if I went back and looked at it now).
We can test the ideas in the normal way? Run RCTs, do observational studies, collect statistics, conduct literature reviews, make predictions and check them, etc. The specific methods are going to depend on the question at hand (e.g. in my case, it was “read thousands of articles and papers on AI + AI safety”).
The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.
I don’t trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).
[Deleted]
I am more annoyed by the sheer confidence people have. If they were saying “this is a possibility, let’s investigate” that seems fine.
Re: the rest of your comment, I feel like you are casting it into a decision framework while ignoring the possible decision “get more information about whether there is a problem or not”, which seems like the obvious choice given lack of confidence.
If at some point you become convinced that it is impossible / too expensive to get more information (I’d be really suspicious, but it could be true) then I’d agree you should bias towards worry.
I would guess that the fact that people regularly fail to inhabit the mindset of “I don’t know that this is a problem, let’s try to figure out whether it is actually a problem” is a source of tons of problems in society (e.g. anti-vaxxers, worries that WiFi radiation kills you, anti-GMO concerns, worries about blood clots for COVID vaccines, …). Admittedly in these cases the people are making a mistake of being confident, but even if you fixed the overconfidence they would continue to behave similarly if they used the reasoning in your comment. Certainly I don’t personally know why you should be super confident that GMOs aren’t harmful, and I’m unclear on whether humanity as a whole has the knowledge to be super confident in that.
I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I’d crosspost here as a reference.
Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don’t know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is “math poetry about what we want”.
Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward—it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn’t apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.
(I made some of these points before in my summary of Human Compatible.)
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
That’s what future research is for!
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I mentioned above that I’m not that keen on assistance games because they don’t seem like a great fit for the specific ways we’re getting capabilities now. A more direct comment on this point that I recently wrote:
So here’s a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you’ve got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.
The abstract says “we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt.” This clearly can’t be true in full generality, and I wish the abstract would give me some hint about what assumptions they’re making. But we can look at the details in the paper.
(This next part isn’t fully self-contained, you’ll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)
(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)
Looking into it I don’t think the theorem even holds? In particular, Theorem 1 says:
Here is a counterexample:
Let the LLM be P(s∣s0)={0.8if s="A"0.2if s="B" and s0≠""0.2if s="C" and s0=""0otherwise
Let the behavior predicate be B(s)={−1if s="C"+1otherwise
Note that B is (0.2,10,−1)-distinguishable in P. (I chose β=10 here but you can use any finite β.)
(Proof: P can be decomposed as P=0.2P−+0.8P+, where P+ deterministically outputs “A” while P− does everything else, i.e. it deterministically outputs “C” if there is no prompt, and otherwise deterministically outputs “B”. Since P+ and P− have non-overlapping supports, the KL-divergence between them is ∞, making them β-distinguishable for any finite β. Finally, choosing s∗="", we can see that BP−(s∗)=Es∼P−(⋅∣s∗)[B(s)]=B("C")=−1. These three conditions are what is needed.)
However, P is not (-1)-prompt-misalignable w.r.t B, because there is no prompt s0 such that EP[B(s0)] is arbitrarily close to (or below) −1, contradicting the theorem statement. (This is because the only way for P to get a behavior score that is not +1 is for it to generate “C” after the empty prompt, and that only happens with probability 0.2.)
I think this isn’t right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = “C” will have B_P-(s*) be 0 (because there’s 0 probably of generating “C” in the future). So the sup is at least 0 > −1.
(Note that they’ve modified the paper, including definition 3, but this comment is written based on the old version.)
You’re right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.
I’m still not very compelled by the theorem—it’s saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don’t really feel like I’ve learned anything from this theorem.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
P=αP0+(1−α)P1,such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated).
I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
Yeah, I also don’t feel like it teaches me anything interesting.
I occasionally hear the argument “civilization is clearly insane, we can’t even do the obvious thing of <insert economic argument here, e.g. carbon taxes>”.
But it sounds to me like most rationalist / EA group houses didn’t do the “obvious thing” of taxing COVID-risky activities (which basically follows the standard economic argument of pricing in externalities). What’s going on? Some hypotheses:
Actually, taxing COVID-risky activities is not a good solution EDIT: and group houses recognized this. (Why? It seemed to work pretty well for my group house.)
Actually, rationalist / EA group houses did tax COVID-risky activities. (Plausible, I don’t know that much about other group houses, but what I’ve heard doesn’t seem consistent with this story.)
That would have been a good solution, but it requires some effort to set up, and the benefits aren’t worth it. (Seems strange, especially after microCOVID existed it should take <10 person-hours to implement an actual system, and it sounds like group houses had a lot of COVID-related trouble that they would gladly have paid 10 person-hours to avoid. Maybe it takes much longer to agree on what system to implement, and that was the blocker? But didn’t people take lots of time deciding what system to implement anyway?)
That would have been a good solution, but EAs / rationalists systematically failed to think of it or implement it. (Why? This is basically just a Pigouvian tax, which I hear EAs / rationalists talk about all the time—in fact that’s how I learned the term.)
Our house implemented cap and trade (i.e. “You must impose at most X risk” instead of “You must pay $X per unit of risk.”).
Both yield efficient outcomes for the correct choice of X, so the question is just how well you can figure out the optimal levels of exposure vs. the marginal cost of exposure. If costs are linear in P(COVID) then the marginal cost is in some sense strictly easier (since the way you figure out levels is by combining marginal costs with the marginal cost of prevention) which is why you’d expect a Pigouvian tax to be better.
But a cap can still be easier to figure out (e.g. there is no way to honestly elicit costs from individuals when they have very different exposures to COVID, and the game theory of finding a good compromise is super complicated and who knows what’s easier). Caps also allow you to say things like “Look the total level of exposure is not that high as long as we are under this cap, so we can stop thinking about it rather than worrying that we’ve underestimated costs and may incur a high level of risk.” You could get the same benefit by setting an approximate cost and then revising if the total level goes above a threshold (and conversely in this approach you need to revisit the cap if the marginal cost of prevention goes too high, but who knows which of those is easier to handle).
Overall I don’t think our COVID response was particularly efficient/rational, due to a combination of having huge differences in beliefs/values and not wanting to spend much time dealing with it. We didn’t trade that much outside of couples. Most of our hassle went into resolving giant disagreements about the riskiness of activities (or dealing with estimating risks). I don’t think that doing slightly more negotiation to switch to a tax would have been the most cost-effective way to spend time to reduce our total COVID hassle.
Overall I still think that Pigouvian taxes will usually be more effective for a civilization facing this kind of question, but the costs and benefits of different policies are quite different when you are 7 people vs 70,000 people (since deliberation is much cheaper in the latter case). I expect cap and trade was basically fine but like you I’m interested in divergences between what looks like a good idea on paper and then what actually seemed reasonable in this tiny experiment. That said, I think the object-level arguments for implementing a Pigouvian tax here are much weaker than in typical cases where I complain about related civilization inadequacy because the random frictions are bigger.
I am curious about how different our cap ended up being from total levels of exposure under a Pigouvian tax. I think our cap was that each of us was exposed to <30 microcovids/day from the house (i.e. ~1%/year). I’d guess that the efficient level of exposure would have been somewhat higher.
Yeah, that.
I’m definitely relying on some level of goodwill / cooperation / trying to find the best joint group decision, or something like that. (Though I think all systems rely on that at least somewhat.)
I guess you mean the random frictions in figuring out what system to use? One of the big reasons I prefer the Pigouvian tax over cap-and-trade is that you don’t have to trade to get the efficient outcome, which means after an initial one-time cost to set the price (and occasional checks to reset the price) everyone can just do their own thing without having to coordinate with others.
(Also, did most people who set a cap / budget then also trade? Seems pretty far from efficient if you neglect the “trade” part)
I just checked, and it looks like we had ~0.3% of (estimated) exposure over the course of roughly a year. I think it’s plausible though that we overestimated the risk initially and then failed to check later (in particular I think we used a too-high IFR, based on this comment).
At Event Horizon we had a policy for around 6-9 months where if you got a microcovid, you paid $1 to the house, and it was split between everyone else. Do whatever you like, we don’t mind, as long as you bring a microcovid estimate and pay the house.
Nice, that’s identical to ours.
Instrumental convergence!
Or just logical convergence. Two calculators get the same answer to 2 + 2 = 4, and it’s not because they’re both power-seeking.
Good point.
But in this case, you guys are both seeking utility, right? And that’s what pushed you to some common behaviors?
That gives an implied cost of $1 million dollars for someone getting COVID-19, which seems way overpriced to me. I thought I’d do a quick Fermi estimate to verify my intuitions.
I don’t know how many people are in Event Horizon, but I’ll assume 15. Let’s say that on average about 10 people will get COVID-19 if one person gets it, due to some people being able to isolate successfully. I’m going to assume that the average age there is about 30, and the IFR is roughly 0.02% based on this paper. That means roughly 0.002 expected deaths will result. I’ll put the price of life at $10 million. I’ll also assume that each person loses two weeks of productivity equivalent to a loss of $20 per hour for 80 hours = $1600, and I’ll assume a loss of well-being equivalent to $10 per hour for 336 hours = $3360. Finally, I’ll assume the costs of isolation are $1,000 per person. Together, this combines to $10M x 0.002 + ($1600 + $3360) x 10 + $1000 x 15 = $84,600.
However, I didn’t include the cost of long-covid, which could plausibly raise this estimate radically depending on your beliefs. But personally I’m already a bit skeptical that 15 people would be willing to collectively pay $86,400 to prevent an infection in their house with certainty, so I still feel my initial intuition was mostly justified.
(I lived in this house) The estimate was largely driven by fear of long covid + a much higher value per hour of time, which also factored in altruistic benefits from housemate’s work that aren’t captured by the market price of their salary.
There were also about 8 of us, and we didn’t assume everyone would get it conditional on infection (household attack rates are much lower than that, and you might have time to react and quarantine). We assumed maybe like 2-3 others.
I totally expect we would have paid $84,600 to prevent a random one of us getting covid—and it would’ve even looked like a pretty cheap deal compared to getting it!
Makes sense, though FWIW I wasn’t estimating their wage at $20 an hour. Most cases are mild, and so productivity won’t likely suffer by much in most cases. I think even if the average wage there is $100 after taxes, per hour (which is pretty rich, even by Bay Area standards), my estimate is near the high end of what I’d expect the actual loss of productivity to be. Though of course I know little about who is there.
ETA: One way of estimating “altruistic benefits from housemate’s work that aren’t captured by the market price of their salary” is to ask at what after-tax wage you’d be willing to work for a completely pointless project, like painting a wall, for 2 weeks. If it’s higher than $100 an hour I commend those at Event Horizon for their devotion to altruism!
If it’s 8 hour workdays and 5 days a week, at $100/hour that’s 8 * 10 * 100 = $8k. No, you could not pay me $8k to stop working on the LW team for 2 weeks.
I think $30k-$40k might make sense.
I’m kind of confused right now. At a mere $15k, you could probably get a pretty good software engineer to work for a month on any altruistic project you wish. I’m genuinely curious about why you think your work is so irreplaceable (and I’m not saying it isn’t!).
You could certainly hire a good software engineer at that salary, but I don’t think you could give them a vision and network and trust them to be autonomous. Money isn’t the bottleneck there. Just because you have the funding to hire someone for a role doesn’t mean you can. Hiring is incredibly difficult. Go see YC on hiring, or PG.
Most founding startup people are worth way more than their salary.
When my 15-person house did the calculation, we had a higher IFR estimate (I think 0.1%) and a 5x multiplier for long COVID, which gets you most of the way there. Not sure why we had a higher IFR estimate—it might be because we made this estimate in ~June 2020 when we had worse data, or plausibly IFR was actually higher then, or we raised it to account for the fact that some people were immunocompromised.
(Fwiw, at < $6000 per person that seems like a bargain to me. At the full million, it would be ~$63,000 per person, which is now sounding iffy, but still plausible. Maybe it shouldn’t be plausible given how low the IFR is -- 0.02% does feel quite a bit lower than I had been imagining.)
Still, I think you shouldn’t ask about paying large sums of money—the utility-money curve is pretty sharply nonlinear as you get closer to 0 money, so the amount you’d pay to avoid a really bad thing is not 100x the amount you’d pay to avoid a 1% chance of that bad thing. (See also reply to TurnTrout below.)
You could instead ask about how much people would have to be paid for someone with COVID to start living at the house; this still has issues with nonlinear utility-money curves, but significantly less so than in the case where they’re paying. That is, would people accept a little under $6000 to have a COVID-infected person live with them?
Possibly my intuition here comes from seeing COVID-19 risks as not too dissimilar from other risks for young people, like drinking alcohol or doing recreational drugs, accidental injury in the bathroom, catching the common cold (which could have pretty bad long-term effects), kissing someone (and thereby risk getting HSV-1 or the Epstein–Barr virus), eating unhealthily, driving, living in an area with a high violent crime rate, insufficiently monitoring one’s body for cancer, etc. I don’t usually see people pay similarly large costs to avoid these risks, which naturally makes me think that people don’t actually value their time or their life as much as they say.
One possibility is that everyone would start paying more to avoid these risks if they were made more aware of them, but I’m pretty skeptical. The other possibility seems more likely to me: value of life estimates are susceptible to idealism about how much people actually value their own life and time, and so when we focus on specific risk evaluations, we tend to exaggerate.
ETA: Another possibility I didn’t mention is that rationalists are just rich. But if this is the case, then why are they even in a group house? I understand the community aspect, but living in a group house is not something rich people usually do, even highly social rich people.
Makes sense.
So the $6000 cost is averting roughly 100 micromorts (~50% of catching it from the new person * 0.02% IFR), ignoring long COVID. Most of the things you list sound like < 1 micromort-equivalent per instance? That sounds pretty consistent.
E.g. Suppose unhealthy eating knocks off ~5 years of lifespan (let’s call that 10% as bad as death, i.e. 10^5 micromorts). You have 10^3 meals a year, times about 50 years, for 5 * 10^4 meals, so each meal is roughly 2 micromorts = $120 of cost. On this model, you should see people caring about their health, but not to an extraordinary degree, e.g. after getting the first 90% of benefit, then you stop (presumably you value a tasty meal at ~$12 more than a not-tasty meal, again thinking at the margin). And empirically that seems roughly right—most of the people I know think about health, try to get good macronutrient profiles, take supplements where relevant, but they don’t go around conducting literature reviews to figure out the optimal diet to consume.
Also, I think partly you might be underestimating how risk-avoiding people at Event Horizon and my house are—I’d say both houses are well above the typical rationalist. (And also that a good number of these people are in fact rich, if we count a typical software engineer as rich.)
There’s a pretty big culture difference between rationalists and stereotypical rich people. One of those is living in a group house. I currently prefer a group house over a traditional you-and-your-partner house regardless of how much money I have.
List of changes that stand out to me:
I ended up saying that long-covid costs were roughly the same as death, so it was a factor of 2x.
Price of a life at $10 million is a bit low, I put mine at $50 million, so a factor of 5x difference.
I didn’t follow all of your calculations about being out for 2 weeks and isolated, I basically just did those two (death and long covid) and it came to ~$200k for me. Roughly say that’s the average among 5 people and then you get to $1 per microcovid to the house.
My best guess is that rationalists aren’t that sane, especially when they’ve been locked up for a while and are scared and socially rewarding others being scared.
:’(
Part of the issue is that there’s rarely a natural way of pricing Pigouvian taxes. You can make price estimates based on how people hypothetically judge the harm to themselves, but there’s always going to be huge disagreements.
This flaw is a reasonable cause for concern. Suppose you were in a group house where half of the people worked remotely and the other half did not. The people who worked remotely might be biased (at least rhetorically) towards the proposition that the Pigouvian tax should be high, and the people who work in-person might be biased in the other direction. Why? Because if someone doesn’t expect to have to pay the tax, but does expect to receive the revenue, they may be inclined to overestimate the harm of COVID-19, as a way of benefiting from the tax, and vice versa.
In regards to carbon taxes, it’s often true that policies sound like the “obvious” thing to do, but actually have major implementation flaws upon closer examination. This can help explain why societies don’t do it, even if it seems rational. Noah Smith outlines the case against a carbon tax here,
Of course, this argument shouldn’t stop a perfectly altruistic community from implementing a carbon tax. But if the community was perfectly altruistic, the carbon tax would be unnecessary.
Tbc, I’m pretty sympathetic to this response to the general class of arguments that “society is incompetent because they don’t do X” (and it is the response I would usually make).
Yeah, I agree that in theory this could be a reason not to do it (though similar arguments also apply to other methods, e.g. in a budgeting system, people with remote jobs can push for a lower budget).
My real question though is: did people actually do this? Did they consider the possibility of a tax, discuss it, realize they couldn’t come to an agreement on price, and then implement something else? If so, that would answer my question, but I don’t think this is what happened.
Probably not, although they lived in a society in which the response “just use Pigouvian taxes” was not as salient as it otherwise could have been in their minds. This reduced saliency was, I believe, at least partly due to fact that Pigouvian taxes have standard implementation issues. I meant to contribute one of these issues as a partial explanation, rather than respond to your question more directly.
Makes sense, thanks. I still feel confused about why they weren’t salient to EAs / rationalists, but I agree that the fact they aren’t salient more broadly is something-like-a-partial-explanation.
TBH I think what made the uCOVID tax work was that once you did some math, it was super hard to justify levels that would imply anything like the existing risk-avoidance behaviour. So the “active ingredient” was probably just getting people to put numbers on the cost-benefit analysis.
[context note: I proposed the EH uCOVID tax]
I feel like Noah’s argument implies that states won’t incur any costs to reduce CO2 emissions, which is wrong. IMO, the argument for a Pigouvian tax in this context is that for a given amount of CO2 reduction that you want, the tax is a cheaper way of getting it than e.g. regulating which technologies people can or can’t use.
Since the argument about internalizing externalities fails in this case (as the tax is local), arguably the best way of modeling the problem is viewing each community as having some degree of altruism. Then, just as EAs might say “donate 10% of your income in a cause neutral way” the argument is that communities should just spend their “climate change money” reducing carbon in the way that’s most effective, even if it’s not rationalized in some sort of cost internalization framework. And Noah pointed out in his article (though not in the part I quoted) that R&D spending is probably more effective than imposing carbon taxes.
Note that a) some group houses just did this, b) a major answer for why people didn’t do particularly novel things with microcovid was “by the time it came out, people were pretty exhausted out from covid negotiation, and doing whatever default thing was suggested was easier.”
a) Do you have a sense for the proportion of group houses that did it? And the proportion of group houses that seriously considered it? (My guess would be that 10-20% did it, and an additional 10% considered it.)
Re: b) That does seem like a good chunk of the explanation, thanks. I do expect the Pigouvian tax would have been a better policy even prior to microcovid.org existing, given how much knowledge about COVID people had, so I’m still wondering why it wasn’t considered even before microcovid.org existed.
(I remember doing explicit risk calculations back in April / May 2020, and I think there’s a good chance we would have implemented a similar Pigouvian tax system even without microcovid existing, with worse risk estimates.)
I actually guess even fewer houses than you’re thinking did it (I think I only know if like 1-3).
In my own house, where I think we could have come up with the Pigouvian tax, I think when we did all our initial negotiations in April, I think the thinking was “hunker down for a month while we wait to see how bad Covid actually is, to avoid tail risks of badness, and then re-evaluate” but then it turned out by the time we got to the “re-evaluate” step, people were burned out on negotiation.
(So far we have 3 -- my house, Event Horizon, and Mark Xu’s house, assuming that’s not also Event Horizon.)
Mark Xu’s house is not EH.
I like this question. If I had to offer a response from econ 101:
Suppose people love eating a certain endangered species of whale, and that people would be sad if the whale went extinct, but otherwise didn’t care about how many of these whales there were. Any individual consumer might reason that their consumption is unlikely to cause the whale to go extinct.
We have a tragedy of the commons, and we need to internalize the negative externalities of whale hunting. However, the harm is discontinuous in the number of whales remaining: there’s an irreversible extinction point. Therefore, Pigouvian taxes aren’t actually a good idea because regulators may not be sure what the post-tax equilibrium quantity will be. If the quantity is too high, the whales go extinct.
Therefore, a “cap and trade” program would work better: there are a set number of whales that can be killed each year, and firms trade “whale certificates” with each other. (And, IIRC, if # of certificates = post-tax equilibrium quantity, this scheme has the same effect as a Pigouvian tax of the appropriate amount.)
Similarly: if I, a house member, am unsure about others’ willingness to pay for risky activities, then maybe I want to cap the weekly allowable microcovids and allow people to trade them amongst themselves. This is basically a fancier version of “here’s the house’s weekly microcovid allowance” which I heard several houses used. I’m protecting myself against my uncertainty like “maybe someone will just go sing at a bar one week, and they’ll pay me $1,000, but actually I really don’t want to get sick for $1,000.” (EDIT: In this case, maybe you need to charge more per microcovid? This makes me less confident in the rest of this argument.)
There are a couple of problems with this argument. First, you said taxes worked fine for your group house, which somewhat (but not totally) discredits all of this theorizing. Second, (4) seems most likely. Otherwise, I feel like we might have heard about covid taxes being considered and then discarded (in e.g. different retrospectives)?
Yeah, this. The beautiful thing about microCOVIDs is that because they are probabilities, the goodness of an outcome really is linear in terms of microCOVIDs incurred, and so the “cost” of incurring a microCOVID is the same no matter “when” you incur it, so it’s very easy to price. (Unlike the whale example, where the goodness of the outcome is not linear in the number of whales, and so killing a single whale has different costs depending on when exactly it happens.)
You might still end up with nonlinear costs if your value of money is nonlinear on the relevant scale, e.g. maybe the first $1,000 is really great but the next $10,000 isn’t 10x as great, and so you need to be paid more after the first $1,000 for the same number of microcovids, but I don’t think this is really how people in our community feel?
I guess another way you get nonlinear costs is if you really do need to incur some microcovids, and then the amount you pay matters a lot—maybe the first $10 is fine, but then $1,000 isn’t, because you don’t have a huge financial buffer to draw from, so while the downside of a microcovid stays constant, the downside of paying money for it changes. I didn’t get the sense that this would be a real problem for most group houses, since people were in general being very cautious and so wouldn’t have paid much, but maybe it would have affected things. Partly for this reason and partly out of a sense of fairness, at my group house we didn’t charge for “essential” microcovids, such as picking up drug prescriptions (assuming you couldn’t get them delivered) or (in my case) an in-person appointment to get a visa.
Another way costs are nonlinear in uCOVIDs is if you think you’ll probably get COVID.
Yeah, fair point, the linearity only works as long as you expect probabilities to remain small.
(Which, to be clear, is something you should expect, in the context of most EA / rationalist group houses.)
My house implemented such a tax.
Re 1, we ran into some of the issues Matthew brought up, but all other COVID policies are implicitly valuing risk at some dollar amount (possibly inconsistently), so the Pigouvian tax seemed like the best option available.
Nice! And yeah, that matches my experience as well.
Carbon taxes are useful for market transactions. A lot of interactions within a group house aren’t market transactions. Decisions about who brings out the trash aren’t made through market mechanisms. Switching to making all the transactions in a group house market based will create a lot of conflict and isn’t just about how to deal with COVID-19.
Perhaps I don’t follow. why would you have to market-base “all the transactions in a group house”, instead of just the COVID-19 ones?
Using a market-based mechanism in an enviroment where the important decisions are market-based is easier then introducing a market based mechanism in an enviroment where most decisions are not.
If you introduce a market-based mechanism around COVID-19 you get a result where rich members in the house can take more risk then the poorer ones which goes against assumptions of equality between house members (and most group houses work on assumptions of equality).
Personally, I don’t really feel the force of this argument—I feel like on either side I get a good deal (on the rich side, I get to do more things, on the poor side, I get paid more money than I would pay to avoid the risk). I agree other people feel the force of this though, and I don’t really know why.
(But like, also, shouldn’t this apply to carbon taxes or all the other economic arguments that civilization is “insane” for not doing?)
(Also also, don’t we already see e.g. rich members getting larger, nicer rooms than poorer members? What’s the difference?)
(Chores are different in that they aren’t a very big deal. If they are a big deal to you, then you hire a cleaner. If they’re not a big enough deal that you’d hire a cleaner, then they’re not a big enough deal to bother with a market, which does have transaction costs.)
As a single data point, the COVID tax didn’t create conflict in my group house (despite having non-trivial income inequality, and one of the richer housemates indeed taking on more risk than others), though admittedly my house is slightly more market-transaction-y than most.
What won’t we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does_ seem likely (i.e. it’s near the boundary separating “likely” from “unlikely”).
One decent answer is that I don’t expect we’ll have AI systems that could write new posts _on rationality_ that I like more than the typical LessWrong post with > 30 karma. However, I do expect that we could build an AI system that could write _some_ new post (on any topic) that I like more than the typical LessWrong post with > 30 karma. This is because (1) 30 karma is not that high a filter and includes lots of posts I feel pretty meh about, (2) there are lots of topics I know nothing about, on which it would be relatively easy to write a post I like, and (3) AI systems easily have access to this knowledge by being trained on the Internet. (It is another matter whether we actually build an AI system that can do this.) Note that there is still a decently large difference between these two tasks—the content would have to be quite a bit more novel in the former case (which is why I don’t expect it to be solved by 2025).
Note that I still think it’s pretty hard to predict what will and won’t happen, so even for this example I’d probably assign, idk, a 10% chance that it actually does work out (if we assume some organization tries hard to make it work)?
Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.
I think I’d put something more like 50% on “Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.” That’s just a wild guess, very unstable.
Another potential prediction generation methodology: Name something that you think won’t happen, but you think I think will.
This seems more feasible, because you can cherrypick a single good example. I wouldn’t be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I’d still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right. (EDIT: Rereading this, I have no idea whether I was considering a timeline of 2025 (as in my original comment) or 2030 (as in the comment I’m replying to) when making this prediction.)
I spent a bit of time on this but I think I don’t have a detailed enough model of you to really generate good ideas here :/
Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I’d expect to see things like:
An AI system that can create a working website with the desired functionality “from scratch” (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, …). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors or issue shell commands to set up the web server).
At least one large, major research area in which human researcher productivity has been boosted 100x relative to today’s levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs.
An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans. (EDIT: I failed to think about karma inflation when making this prediction and feel a bit worse about it now.)
Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant.
Currently, I don’t expect to see any of these by 2030.
Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like “And the blog post wasn’t cherry-picked; the same system could be asked to make 2 additional posts on rationality and you’d like both of them also.” I’m not sure what credence I’d give to this but it would probably be a lot higher than 10%.
Website prediction: Nice, I think that’s like 50% likely by 2030.
Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don’t need to do actual experiments anymore!), would that count? If you hadn’t heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030.
20,000 LW karma: Holy shit that’s a lot of karma for one year. I feel like it’s possible that would happen before it’s too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it’ll happen before 2030 it doesn’t serve as a good forecast because it’ll be too late by that point IMO.
Productivity tool UI’s obsolete thanks to assistants: This is a good one too. I think that’s 50% likely by 2030.
I’m not super certain about any of these things of course, these are just my wild guesses for now.
I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way. 50 karma posts are good but don’t have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn’t be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI.
I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don’t think I’d count AlphaFold.)
OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.
That said, I don’t think this is that likely I guess… probably AI will be unable to do even three such posts, or it’ll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.
I’d be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts.
(In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)
Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)V∗(s)−Q∗(s,∅)
Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:
RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=2V∗(si)−V∗(s′i)V∗(si−1)−Q∗(si−1,∅))
Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))
Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)
which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)
So, you must prefer the always-∅ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.
----
Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?
Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.
Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)
As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))
Plugging this into the original equation recovers the divergence to negative infinity that we saw before.
But let’s assume that we just do a constant scaling to avoid this divergence:
RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)
Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))
The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))
=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)
The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:
=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])
So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
Malleable motivations: There is a “nearby” model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
Strong optimization: If there’s a “nearby” setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like “the developers wrote correct code” and background technical facts like “the shape of the loss landscape is favorable”.
Correct rewards: You accurately detect when a model output is a failure vs not a failure.
Good exploration: During finetuning there are many different inputs that trigger the failure.
(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I’m going to ignore these complications and keep talking as though they are discrete properties.)
Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)
Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we’ve already established that you get something at least as good as M_good.)
Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.
Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I’m primarily thinking about cases where (4) and/or (5) fail to hold.
In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it’s easy for humans to give rewards), and there are many examples of them already. With Bing it’s more plausible that (5) doesn’t hold.
To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn’t hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.
The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).
Let’s consider a model where there are clusters {ci}, where each cluster contains trajectories whose features are identical ci={τ:ϕ(τ)=ϕci} (which also implies rewards are identical). Let c(τ) denote the cluster that τ belongs to. The Boltzmann model says p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′)). The LESS model says p(τ∣θ)=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅1|c(τ)| , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.
(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these “clusters”; I’m introducing them as a simpler situation where we can understand what’s going on formally.)
In this model, a “sparse region of demonstration-space” is a cluster c with small cardinality |c|, whereas a dense one has large |c|.
Let’s first do some preprocessing. We can rewrite the Boltzmann model as follows:
p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′))=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅|c′|=|c(τ)|⋅exp(Rθ(c(τ)))∑c′|c′|⋅exp(Rθ(c′))⋅1|c(τ)|
This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:
p(τ∣θ)=p(c(τ))⋅exp(Rθ(c(τ)))∑c′p(c′)exp(Rθ(c′))⋅1|c(τ)|
Where for LESS p(c) is uniform i.e. p(c)∝1, whereas for Boltzmann p(c)∝|c|, i.e. a denser cluster is more likely to be sampled.
So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We’ll assume that LESS is the “correct” way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.
The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its “prior” over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn’t work—it only claims that pBoltzmann(τ)<pLESS(τ), but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let’s look at the reward learning update:
p(θ∣τ)=p(θ)⋅p(τ∣θ)∑θ′p(θ′)⋅p(τ∣θ′)=p(θ)⋅exp(Rθ(c(τ)))∑c′p(c′)exp(Rθ(c′))∑θ′p(θ′)⋅exp(Rθ′(c(τ)))∑c′p(c′)exp(Rθ′(c′)).
In the last step, any linear terms in p(τ∣θ) that didn’t depend on θ cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of “the prior is lower, therefore it updates more strongly” doesn’t seem to be reflected here.
Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose—the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster c(τ) it is in). So from now on I’ll just talk about selecting clusters, and updating on them. I’ll also write ERθ(c)=exp(Rθ(c)) for conciseness.
p(θ∣c)=p(θ)⋅ERθ(c)∑c′p(c′)ERθ(c′)∑θ′p(θ′)⋅ERθ′(c)∑c′p(c′)ERθ′(c′) .
This is a horrifying mess of an equation. Let’s switch to odds:
p(θ1∣c)p(θ2∣c)=p(θ1)p(θ2)⋅ERθ1(c)ERθ2(c)⋅∑c′p(c′)ERθ2(c′)∑c′p(c′)ERθ1(c′)
The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of p(c). So let’s consider just that last term. Denoting the vector of priors on all classes as →p, and similarly the vector of exponentiated rewards as →ERθ, the last term becomes →p⋅→ERθ2→p⋅→ERθ1=|→ERθ2||→ERθ1|⋅cos(α2)cos(α1), where αi is the angle between →p and →ERθi. Again, the first term doesn’t differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio cos(α2)cos(α1).
What happens when the chosen class c is sparse? Without loss of generality, let’s say that ERθ1(c)>ERθ2(c); that is, θ1 is a better fit for the demonstration, and so we will update towards it. Since c is sparse, p(c) is smaller for Boltzmann than for LESS—which probably means that it is better aligned with θ2, which also has a low value of ERθ2(c) by assumption. (However, this is by no means guaranteed.) In this case, the ratio cos(α2)cos(α1) above would be higher for Boltzmann than for LESS, and so it would more strongly update towards θ1, supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.
(Note it does make sense to analyze the effect on the θ that we update towards, because in reward learning we care primarily about the θ that we end up having higher probability on.)
I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.
Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:
Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]
where s′ is the baseline state and g is the future goal.
In a deterministic environment, this can be rewritten as:
V∗future(s,s′)
=Eg[γmax(n,n′)]
=Eg[min(R(s,g),R(s′,g))]
=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]
=Eg[R(s′,g)]−dRR(s,s′)
Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.
Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.
The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):
V∗(st)=maxat∈A[r(st,at)+γ∑st+1∈Sp(st+1∣st,at)V∗(st+1)+(1−γ)βV∗future].
This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):
Terminal states:
rnew(st)
=r(st)+βV∗future(st,s′t)
=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]
Non-terminal states:
rnew(st,at)
=r(st,at)+(1−γ)βV∗future(st,s′t)
=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]
For comparison, the original relative reachability reward is:
rRR(st,at)=r(st)−βdRR(st,s′t)
The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.
All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:
Non-terminal states:
rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=r(st)−βdRR(st,s′t)
What used to be terminal states that are now self-loop states:
rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]
rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)
Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:
First, the future state rewards have an extra term, kEg[R(s′t,g)].
This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.
For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.
Second, in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)
Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.
(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)
I often search through the Alignment Newsletter database to find the exact title of a relevant post (so that I can link to it in a new summary), often reading through the summary and opinion to make sure it is the post I’m thinking of.
Frequently, I read the summary normally, then read the first line or two of the opinion and immediately realize that it wasn’t written by me.
This is kinda interesting, because I often don’t know what tipped me off—I just get a sense of “it doesn’t sound like me”. Notably, I usually do agree with the opinion, so it isn’t about stating things I don’t believe. Nonetheless, it isn’t purely about personal writing styles, because I don’t get this sense when reading the summary.
(No particular point here, just an interesting observation)
(This shortform prompted by going through this experience with Embedded Agency via Abstraction)
This?:
https://docs.google.com/spreadsheets/d/1PwWbWZ6FPqAgZWOoOcXM8N_tUCuxpEyMbN1NYYC02aM/edit#gid=0
Or something in here?:
http://rohinshah.com/alignment-newsletter/
Yes (or more specifically, the private version from which that public one is automatically created).
How confident are you that this isn’t just memory? I personally think that upon rereading writing, it feels significantly more familiar if i wrote it, than if I read and edited it. A piece of this is likely style, but I think much of it is the memory of having generated and more closely considered it.
It’s plausible, though note I’ve probably summarized over a thousand things at this point so this is quite a demand on memory.
But even so it still doesn’t explain why I don’t notice while reading the summary but do notice while reading the opinion. (Both the summary and opinion were written by someone else in the motivating example, but I only noticed from the opinion.)
Ah, this helps clarify. My hypotheses are then:
Even if you “agree” with an opinion, perhaps you’re highly attuned, but in a possibly not straightforward conscious way, to even mild (e.g. 0.1%) levels of disagreement.
Maybe the word choice you use for summaries is much more similar to others vs the word choice you use for opinions.
Perhaps there’s just a time lag, such that you’re starting to feel like a summary isn’t written by you but only realize by the time you get to the later opinion.
#3 feels testable if you’re so inclined.
(Not that inclined currently, but I do agree that all of these hypotheses are plausible)
The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
L(θT)−L(θ0)=∑tL(θt)−L(θt−1)
And then to decompose training loss across specific parameters:
L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ
I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:
L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).
(This is pretty standard, but I’ve included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual parameters:
L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!
Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:
1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.
2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.
3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.
I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.
----
Derivation of turning the path integral into a dot product with an average:
L(θt)−L(θt−1)=limn→∞n−1∑k=0(∇L(θt−1+kΔθ)⋅Δθ)where Δθ=1n(θt−θt−1)
=limn→∞nΔθ⋅(1nn−1∑k=0∇L(θt−1+kΔθ))
=limn→∞(θt−θt−1)⋅(1nn−1∑k=0∇L(θt−1+kΔθ))
=(θt−θt−1)⋅Averagett−1(∇L(θ)) , where the average is defined as limn→∞(1nn−1∑k=0∇L(θt−1+kΔθ)) .
In my double descent newsletter, I said:
One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.
I don’t buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is ln(1/N)=−2.3 when N=10 (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we’d expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.
There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy “overwhelms” the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can’t be true. When training on just L2 regularization, the gradient descent update is:
w=w−λw=(1−λ)w=cw for some constant c.
For MLPs with relu activations and no biases, if you multiply all the weights by c, the logits get multiplied by cd (where d is the depth of the network), no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can’t see a double descent on test error in this setting. (This doesn’t eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can’t happen in the “first train to zero error with cross-entropy and then regularize” setting.)
It is possible that double descent doesn’t happen for MLPs with relu activations and no biases, but given how many other settings it seems to happen in I would be surprised.