abramdemski comments on Formal Inner Alignment, Prospectus

abramdemski 19 May 2021 17:46 UTC
LW: 3 AF: 3
AF
I think let’s step back for a second, though. Suppose you were in the epistemic position “yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network’s epistemic uncertainty/submodel-mismatch, and having come up blank...” what’s the conclusion here? I don’t think it’s “my main guess is that there’s no way to apply this in practice”.
A couple of separate points:
- My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
- I feel like there’s something a bit wrong with the ‘theory/practice’ framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those ‘in theory’ may or may not translate to solving the practical issues ‘in practice’. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it’s even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don’t really think of something like that as a “theoretical solution” to the realizability probelm at all. That’s reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.
Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network’s epistemic uncertainty, it’s potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don’t think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;
Yeah, this is a fair point.
and 10 hours of unsuccessful search isn’t even close to the amount of time needed to demote that area from “most promising”.
To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn’t mean no one should ever think about it again, but it does seem worth communicating why the direction hasn’t born fruit, at least to the extent that that line of research is happy being public.)
I think the question we are discussing here is: “yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?”
Sounds right to me.
I don’t see how this example makes that point. If the threshold of “unrealistic” is set in such a way that “realistic” models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally’s affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
Ah, ok! Basically this is a new way of thinking about it for me, and I’m not sure what I think yet. My picture was that we argue that the top-weighted “good” (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think “the good guy” (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).
(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally’s allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we’re just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
I don’t really get why yet—can you spell the (brute-force) argument out in more detail?
(going for now, will read+reply more later)
- michaelcohen 20 May 2021 22:27 UTC
  LW: 2 AF: 2
  AF Parent
  A few quick thoughts, and I’ll get back to the other stuff later.
  To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.
  That’s good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms—i.e. it is not even a theoretical solution—then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.
  a “theoretical solution” to the realizability probelm at all.
  I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you’re saying. The theory regards a “fair” demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of “theoretical” that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator’s behavior, so this doesn’t present a huge challenge for practically arriving at a good model. But that’s a whole can of worms.
  My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
  Okay good, this worry makes much more sense to me.
  - abramdemski 27 May 2021 19:05 UTC
    LW: 5 AF: 4
    AF Parent
    Just want to note that although it’s been a week this is still in my thoughts, and I intend to get around to continuing this conversation… but possibly not for another two weeks.