Wei Dai comments on Can corrigibility be learned safely?

Wei Dai 25 Apr 2018 1:39 UTC
4 points

Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.

(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)

Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?

What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.