abramdemski comments on Formal Inner Alignment, Prospectus

abramdemski May 18, 2021, 7:44 PM
LW: 5 AF: 4
AF
Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things!
I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.
Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too.
It’s a fair point that it’s no slower than idealized MAP. But the most important corner cut by deep learning is that deep learning represents just one hypothesis at a time, searching the space by following a gradient rather than by explicitly comparing options. The question is, how can we cut the same corner for the consensus algorithm, which needs to compare the outputs of many hypotheses?
In some settings, this is possible: for sufficiently simple hypothesis spaces, we can check consensus without explicitly computing a bunch of hypotheses. However, for deep learning, it seems rather difficult.
So, it seems like the best we can expect to do for deep learning is to train and run 100 hypotheses (or whatever number). This is a huge approximation in terms of MAP (since we have no guarantees that we are finding the 100 most probable, or anything), but we can naturally re-frame the consensus-alg guarantee in terms of frequency-of-malign-results for the NN training (rather than an assumption about at least 1 of the 100 most probable hypotheses being non-malign).
But this still means that, for a consensus of N hypotheses, the consensus algorithm will be N times slower (in terms of both training time and inference time). I expect N to be quite large, for reasons similar to what I said in the post: not only do we have to think N is large enough that at least one of the hypotheses is benign, but also we have to think that the benign hypothesis is at least as capable as any of the malign hypotheses (because otherwise it could get unlucky and be eliminated). For the purpose of imitation learning, this means we think one of the hypotheses has exactly learned to imitate a human. Since this problem is going to be quite under-determined even with quite large amounts of data, it seems like N needs to be large enough to chance upon it reliably.
Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two.
My main point here is: this isn’t the most important corner I see deep learning cutting, as I described above.
In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.
“Keeping track of runner-ups” in the MAP case makes a lot of sense. But for the deep learning case, it sounds like you are suggesting that we do consensus on part of the path that a single training run takes in hypothesis space. This seems like a pretty bad idea, for several reasons:
1. They will all be pretty similar, so getting consensus doesn’t tell us much. We generally have no reason to assume that some point along the path will be benign—undercutting the point of the consensus algorithm.
2. The older parts of the path will basically be worse, so if you keep a lot of path, you get a lot of not-very-useful failures of consensus.
3. Lottery-ticket research suggests that if a malign structure is present in the end, then precursors to it will be present at the beginning.
So it seems to me that you at least need to do independent training runs (w/ different random initializations) for the different models which you are checking consensus between, so that they are “independent” in some sense (perhaps most importantly, drawing different lottery tickets).
However, running the same training algorithm many times may not realistically explore the space enough. We sort of expect the same result from the same training procedure. Sufficiently large models will contain malign lotto tickets with high probability (so we can’t necessarily argue from “one of these N initializations almost certainty lacks a malign lotto ticket” without very high N). The gradient landscape contains the same demons; maybe the chances of being pulled into them during training are just quite high. All of this suggests that N may need to be incredibly high, or, other measures may need to be taken to ensure that the consensus is taken between a greater variety of hypotheses than what we get from re-running training.
I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones.
Right, that’s what I meant. Suppose we’re trying to imitation-learn about Sally. Sally has a bunch of little nuances to her personality. For example, she has Opinions about flowers, butter, salt.… and a lot of other little things, which I’m supposing can’t be anticipated from each other. I’m suggesting that no Bayesian hypothesis can get all of those things right from the get go. So imagine that after a while, the top 100 hypotheses are all pretty good at modeling Sally in typical situations, but each one has a different “sally secret”—a different lucky guess about one of these little things.
In particular, a benign hypothesis (let’s call it Ted) knows about the flower thing, and a malign hypothesis (Jerry) knows about a butter thing.
Unexpectedly, the butter thing becomes a huge focus of Sally’s life for a while. Ted falls far out of favor compared to Jerry, since Ted just didn’t see this coming. Maybe Ted updates pretty quickly, but it’s too late, Ted has lost a bunch of Bayes points.
Whereas if the flower thing had come up, instead, we could consider ourselves lucky; Ted would still be in the running.
With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.
That’s the general idea I had in mind. The point is that N has to be high enough that we don’t expect this to happen at any point.
As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.
Why will there be one best? That’s the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top. (This is one of the classic frequentist critiques of bayesianism.) That’s the formal thing that my “flowers vs butter” story about Sally is supposed to point at.
We can do better with logical induction or infrabayesianism. But I’m still leery of consensus-type approaches for those on other grounds.
Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.
When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well.
Haha :) OK, I misinterpreted you.
But the idealized issues (can a bayesian hypothesis model the computer it is running on?) have practical analogues (can we expect hypothesis generation to produce one model which is uniquely best, or will different models simply “know different things”?). So when I judge an idealized algorithm I think about those practical analogues, and whether they seem to suggest problems for realistic approximations.
In particular, you want your safety argument to translate well, since otherwise, what’s the point?
And for the consensus algorithm, I’ve already explained why I do think it’s particularly bad in this way.
Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.
Right, so what I’m getting is that we can think of GPT-N as already containing a large ensemble. But how would you check consensus?
It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.
So, what, the human has to act like they’re driving the car forever, even though the AI autopilot is really good? If not, then there are many scenarios where the human would do well if driving normally but would crash if they were in the middle of sipping their drink when the feedback call unexpectedly came.
In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like “case closed” to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.
If you told me “actually there is no feedback call button, the human is constantly monitored, the only difference is whether the data is used” then I’d concede that point; but then it seems like we’re stuck driving the car forever.
If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.
Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?
A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed “now!”. If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn’t forget to put a stake in the ground marking “lots of progress”. If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.
This strikes me as a disorganized mix of arguments, and I’ve kind of run out of time atm to write this response, sorry! I will think more on what you have said.
Maybe this was someone else, but it could have been me.
I like the plausible deniability my anonymised examples gave me ;3
- michaelcohen May 19, 2021, 2:43 PM
  LW: 2 AF: 2
  AF Parent
  I felt I had remained quiet about my disagreement with you for too long
  Haha that’s fine. If you don’t voice your objections, I can’t respond to them!
  I think let’s step back for a second, though. Suppose you were in the epistemic position “yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network’s epistemic uncertainty/submodel-mismatch, and having come up blank...” what’s the conclusion here? I don’t think it’s “my main guess is that there’s no way to apply this in practice”. Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network’s epistemic uncertainty, it’s potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don’t think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated.
  If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider “unrealistic” models is such a boon), and there were thus lots of theoretical solutions floating around, any given one might not be such a strong signal: “this is the place to look for realistic solutions”. But if there’s only one floating around, that’s a very a strong signal that we might be looking in a fundamental part of the solution space. In general, I think the most practical place to look for practical solutions is near the best theoretical one, and 10 hours of unsuccessful search isn’t even close to the amount of time needed to demote that area from “most promising”.
  I think this covers my take on a few of your points, but some of your points are separate. In particular, some of them bear on the question of whether this really is an idealized solution in the first place.
  With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.
  I think the question we are discussing here is: “yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?” I don’t see how this example makes that point. If the threshold of “unrealistic” is set in such a way that “realistic” models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally’s affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
  (A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally’s allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we’re just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
  Why will there be one best? That’s the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top.
  Yeah I was thinking that the realistic setting was a finite length setting, with the one best being the best at the end. (And if it is best at the end, you can put a lower bound on how low its posterior weight ever was, since it’s hard to recover from having vanishingly small weight, and then alpha just needs to be set to include that). But fair enough to be interested an infinite lifetime with a finite model class that does not include the truth. So yeah, a model’s ranking can oscillate forever, although I think intelligent systems won’t really do this in practice? I think in an infinite lifetime, it is reasonable in practice to assume that if something is generally intelligent, it has some sort of ability to generate new hypotheses over time. And if you have two hypotheses about Sally, one which explains her relationship to butter, and one which explains her relationship to flowers, it’s pretty easy to generate another that combines these. So I’m not sure how instructive it is to consider the finite model class/infinite horizon Bayesian sequence prediction as a model for state-of-the-art AI.
  So, what, the human has to act like they’re driving the car forever, even though the AI autopilot is really good?
  The suggestion I make later is that a few timesteps in advance, the human is told to start acting like they’re driving the car, but they’re not controlling it yet, and then comes the timestep when autopilot actually defers to them, and that’s the behavior that gets recorded. Your response is
  Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?
  That proposal was making a different point—it was refuting exactly the concern that the human would have to act like they’re driving the car forever. This protocol just defends active imitation learning in general as potentially viable for the operation of heavy machinery. Assume the demonstrator is sitting in a driving simulator, and all the relevant info shows up on their screens; this allows a small team to be the demonstrators for many autopilots simultaneously. It’s not a question of whether an inner optimizer could exploit this protocol by finding the most inconvenient time for the human to be called in, and then querying the human—it’s much worse than that! If there’s a bad time for the human drive to be called in, then the autopilot will fail at that time regardless; the inner optimizer doesn’t even have to arrange an actual query. So there’s nothing to “exploit” really. And defending against the “much worse” version, I do think that the policy of a human driver with 5 seconds of acquaintance with the current state, does not crash the car at a rate significantly higher than a real human driver. Also, it may be worth saying that I don’t think we really need general intelligence to operate heavy machinery at human level, so I’m not sure the heavy machinery case is very important to consider in the first place.
  In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like “case closed” to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.
  You said you’d continue to think about the points I made about why I don’t have the same reaction to this, so we can come back to those later. But stepping back, I think I want to make a point about the information loss you impose on your beliefs when applying the security mindset or Murphy’s Law. A malign model gets to say “query now” once! And then it is removed from the posterior because of its wrong prediction. That is not many bits for an output channel. But… yes, it is an output channel. And if you insist that all output channels for malign stuff transmit world-ending signals, just like all aquariums will leak, I agree that that is sometimes a useful perspective to adopt and run with, but the exercise of making actual probability estimates about events is potentially different from exercising a security mindset. I mentioned information loss. With a security mindset, there is no point in distinguishing a malign agent with a 20 bit output channel from one with an unbounded output channel. You just close the case on both. But we really might live in a world where the former is existentially safe, and the latter is not—I mean that even seems likely to me. Even from my perspective, where I don’t think a 20 bit output channel is dangerous, I understand the merit in disregarding that belief as the motivation for a research agenda, but I do think that’s a whole different sort of “epistemic status”. So when you say “I take it as a given that a sufficiently intelligent malign actor could exploit this” that makes sense to me as an action, but not as a proposition. (I mean, I think it’s incorrect, so I certainly don’t think it’s self-evidently true).
  I guess I can expand on the object level too. Like, take a moment to really stare at 01110010101011010000, and imagine that you take it to be a signal from some deeply wise agent that arranged for a query at timestep number 469,712. Suppose further that you already think this deeply wise agent is not malign, but rather is much maligned! (So it doesn’t even need to spend the first few bits enthralling you). What do you do now? Or maybe timestep 469, 712 doesn’t mean that bitstring—maybe it means “this is a very meaningful world-state, and of similar world-states, the fact that this one was chosen tells me I should run program X on a supercomputer, so it can engage in some good, clean world-optimization”. Look around you right now—if someone told you “now is important, so think about that, and go run a program”, what program do you produce? Am I taking a huge risk even asking that, because there’s a precise time such that if I did, you’d run that program and it would end the world?
  - abramdemski May 19, 2021, 5:46 PM
    LW: 3 AF: 3
    AF Parent
    I think let’s step back for a second, though. Suppose you were in the epistemic position “yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network’s epistemic uncertainty/submodel-mismatch, and having come up blank...” what’s the conclusion here? I don’t think it’s “my main guess is that there’s no way to apply this in practice”.
    A couple of separate points:
    My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
    I feel like there’s something a bit wrong with the ‘theory/practice’ framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those ‘in theory’ may or may not translate to solving the practical issues ‘in practice’. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it’s even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don’t really think of something like that as a “theoretical solution” to the realizability probelm at all. That’s reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.
    Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network’s epistemic uncertainty, it’s potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don’t think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;
    Yeah, this is a fair point.
    and 10 hours of unsuccessful search isn’t even close to the amount of time needed to demote that area from “most promising”.
    To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn’t mean no one should ever think about it again, but it does seem worth communicating why the direction hasn’t born fruit, at least to the extent that that line of research is happy being public.)
    I think the question we are discussing here is: “yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?”
    Sounds right to me.
    I don’t see how this example makes that point. If the threshold of “unrealistic” is set in such a way that “realistic” models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally’s affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
    Ah, ok! Basically this is a new way of thinking about it for me, and I’m not sure what I think yet. My picture was that we argue that the top-weighted “good” (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think “the good guy” (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).
    (A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally’s allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we’re just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
    I don’t really get why yet—can you spell the (brute-force) argument out in more detail?
    (going for now, will read+reply more later)
    - michaelcohen May 20, 2021, 10:27 PM
      LW: 2 AF: 2
      AF Parent
      A few quick thoughts, and I’ll get back to the other stuff later.
      To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.
      That’s good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms—i.e. it is not even a theoretical solution—then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.
      a “theoretical solution” to the realizability probelm at all.
      I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you’re saying. The theory regards a “fair” demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of “theoretical” that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator’s behavior, so this doesn’t present a huge challenge for practically arriving at a good model. But that’s a whole can of worms.
      My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
      Okay good, this worry makes much more sense to me.
      - abramdemski May 27, 2021, 7:05 PM
        LW: 5 AF: 4
        AF Parent
        Just want to note that although it’s been a week this is still in my thoughts, and I intend to get around to continuing this conversation… but possibly not for another two weeks.