michaelcohen comments on Formal Inner Alignment, Prospectus

michaelcohen May 19, 2021, 2:43 PM
LW: 2 AF: 2
AF
I felt I had remained quiet about my disagreement with you for too long
Haha that’s fine. If you don’t voice your objections, I can’t respond to them!
I think let’s step back for a second, though. Suppose you were in the epistemic position “yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network’s epistemic uncertainty/submodel-mismatch, and having come up blank...” what’s the conclusion here? I don’t think it’s “my main guess is that there’s no way to apply this in practice”. Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network’s epistemic uncertainty, it’s potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don’t think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated.
If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider “unrealistic” models is such a boon), and there were thus lots of theoretical solutions floating around, any given one might not be such a strong signal: “this is the place to look for realistic solutions”. But if there’s only one floating around, that’s a very a strong signal that we might be looking in a fundamental part of the solution space. In general, I think the most practical place to look for practical solutions is near the best theoretical one, and 10 hours of unsuccessful search isn’t even close to the amount of time needed to demote that area from “most promising”.
I think this covers my take on a few of your points, but some of your points are separate. In particular, some of them bear on the question of whether this really is an idealized solution in the first place.
With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.
I think the question we are discussing here is: “yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?” I don’t see how this example makes that point. If the threshold of “unrealistic” is set in such a way that “realistic” models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally’s affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally’s allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we’re just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
Why will there be one best? That’s the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top.
Yeah I was thinking that the realistic setting was a finite length setting, with the one best being the best at the end. (And if it is best at the end, you can put a lower bound on how low its posterior weight ever was, since it’s hard to recover from having vanishingly small weight, and then alpha just needs to be set to include that). But fair enough to be interested an infinite lifetime with a finite model class that does not include the truth. So yeah, a model’s ranking can oscillate forever, although I think intelligent systems won’t really do this in practice? I think in an infinite lifetime, it is reasonable in practice to assume that if something is generally intelligent, it has some sort of ability to generate new hypotheses over time. And if you have two hypotheses about Sally, one which explains her relationship to butter, and one which explains her relationship to flowers, it’s pretty easy to generate another that combines these. So I’m not sure how instructive it is to consider the finite model class/infinite horizon Bayesian sequence prediction as a model for state-of-the-art AI.
So, what, the human has to act like they’re driving the car forever, even though the AI autopilot is really good?
The suggestion I make later is that a few timesteps in advance, the human is told to start acting like they’re driving the car, but they’re not controlling it yet, and then comes the timestep when autopilot actually defers to them, and that’s the behavior that gets recorded. Your response is
Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?
That proposal was making a different point—it was refuting exactly the concern that the human would have to act like they’re driving the car forever. This protocol just defends active imitation learning in general as potentially viable for the operation of heavy machinery. Assume the demonstrator is sitting in a driving simulator, and all the relevant info shows up on their screens; this allows a small team to be the demonstrators for many autopilots simultaneously. It’s not a question of whether an inner optimizer could exploit this protocol by finding the most inconvenient time for the human to be called in, and then querying the human—it’s much worse than that! If there’s a bad time for the human drive to be called in, then the autopilot will fail at that time regardless; the inner optimizer doesn’t even have to arrange an actual query. So there’s nothing to “exploit” really. And defending against the “much worse” version, I do think that the policy of a human driver with 5 seconds of acquaintance with the current state, does not crash the car at a rate significantly higher than a real human driver. Also, it may be worth saying that I don’t think we really need general intelligence to operate heavy machinery at human level, so I’m not sure the heavy machinery case is very important to consider in the first place.
In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like “case closed” to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.
You said you’d continue to think about the points I made about why I don’t have the same reaction to this, so we can come back to those later. But stepping back, I think I want to make a point about the information loss you impose on your beliefs when applying the security mindset or Murphy’s Law. A malign model gets to say “query now” once! And then it is removed from the posterior because of its wrong prediction. That is not many bits for an output channel. But… yes, it is an output channel. And if you insist that all output channels for malign stuff transmit world-ending signals, just like all aquariums will leak, I agree that that is sometimes a useful perspective to adopt and run with, but the exercise of making actual probability estimates about events is potentially different from exercising a security mindset. I mentioned information loss. With a security mindset, there is no point in distinguishing a malign agent with a 20 bit output channel from one with an unbounded output channel. You just close the case on both. But we really might live in a world where the former is existentially safe, and the latter is not—I mean that even seems likely to me. Even from my perspective, where I don’t think a 20 bit output channel is dangerous, I understand the merit in disregarding that belief as the motivation for a research agenda, but I do think that’s a whole different sort of “epistemic status”. So when you say “I take it as a given that a sufficiently intelligent malign actor could exploit this” that makes sense to me as an action, but not as a proposition. (I mean, I think it’s incorrect, so I certainly don’t think it’s self-evidently true).
I guess I can expand on the object level too. Like, take a moment to really stare at 01110010101011010000, and imagine that you take it to be a signal from some deeply wise agent that arranged for a query at timestep number 469,712. Suppose further that you already think this deeply wise agent is not malign, but rather is much maligned! (So it doesn’t even need to spend the first few bits enthralling you). What do you do now? Or maybe timestep 469, 712 doesn’t mean that bitstring—maybe it means “this is a very meaningful world-state, and of similar world-states, the fact that this one was chosen tells me I should run program X on a supercomputer, so it can engage in some good, clean world-optimization”. Look around you right now—if someone told you “now is important, so think about that, and go run a program”, what program do you produce? Am I taking a huge risk even asking that, because there’s a precise time such that if I did, you’d run that program and it would end the world?
- abramdemski May 19, 2021, 5:46 PM
  LW: 3 AF: 3
  AF Parent
  I think let’s step back for a second, though. Suppose you were in the epistemic position “yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network’s epistemic uncertainty/submodel-mismatch, and having come up blank...” what’s the conclusion here? I don’t think it’s “my main guess is that there’s no way to apply this in practice”.
  A couple of separate points:
  - My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
  - I feel like there’s something a bit wrong with the ‘theory/practice’ framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those ‘in theory’ may or may not translate to solving the practical issues ‘in practice’. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it’s even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don’t really think of something like that as a “theoretical solution” to the realizability probelm at all. That’s reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.
  Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network’s epistemic uncertainty, it’s potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don’t think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;
  Yeah, this is a fair point.
  and 10 hours of unsuccessful search isn’t even close to the amount of time needed to demote that area from “most promising”.
  To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn’t mean no one should ever think about it again, but it does seem worth communicating why the direction hasn’t born fruit, at least to the extent that that line of research is happy being public.)
  I think the question we are discussing here is: “yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?”
  Sounds right to me.
  I don’t see how this example makes that point. If the threshold of “unrealistic” is set in such a way that “realistic” models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally’s affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
  Ah, ok! Basically this is a new way of thinking about it for me, and I’m not sure what I think yet. My picture was that we argue that the top-weighted “good” (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think “the good guy” (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).
  (A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally’s allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we’re just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
  I don’t really get why yet—can you spell the (brute-force) argument out in more detail?
  (going for now, will read+reply more later)
  - michaelcohen May 20, 2021, 10:27 PM
    LW: 2 AF: 2
    AF Parent
    A few quick thoughts, and I’ll get back to the other stuff later.
    To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.
    That’s good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms—i.e. it is not even a theoretical solution—then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.
    a “theoretical solution” to the realizability probelm at all.
    I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you’re saying. The theory regards a “fair” demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of “theoretical” that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator’s behavior, so this doesn’t present a huge challenge for practically arriving at a good model. But that’s a whole can of worms.
    My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
    Okay good, this worry makes much more sense to me.
    - abramdemski May 27, 2021, 7:05 PM
      LW: 5 AF: 4
      AF Parent
      Just want to note that although it’s been a week this is still in my thoughts, and I intend to get around to continuing this conversation… but possibly not for another two weeks.