I felt like the report was unusually well-motivated when I put my “mainstream ML” glasses on, relative to a lot of alignment work.
ARC’s overall approach is probably my favorite out of alignment research groups I’m aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but… the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn’t seem all that common to do this, and it’s not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.
There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty—as applied to the SmartVault scenario, this might look like: “train lots of diverse mappings between the AI’s ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human’s ontology, try to figure out what’s going on”. IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts.
There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty—as applied to the SmartVault scenario, this might look like: “train lots of diverse mappings between the AI’s ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human’s ontology, try to figure out what’s going on”. IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts
I broadly agree that “train a bunch of models and panic if any of them say something is wrong.” The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix.
Not sure if this is relevant in practice, but… the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn’t seem all that common to do this, and it’s not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.
We’re imagining the case where the predictor internally performs inference in a learned model, i.e. we’re not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don’t think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases.
I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
(We’re actually planning to do a narrower contest focused on ELK proposals.)
I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.
Some other thoughts:
I felt like the report was unusually well-motivated when I put my “mainstream ML” glasses on, relative to a lot of alignment work.
ARC’s overall approach is probably my favorite out of alignment research groups I’m aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but… the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn’t seem all that common to do this, and it’s not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.
There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty—as applied to the SmartVault scenario, this might look like: “train lots of diverse mappings between the AI’s ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human’s ontology, try to figure out what’s going on”. IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts.
Thanks for the kind words (and proposal)!
I broadly agree that “train a bunch of models and panic if any of them say something is wrong.” The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix.
We’re imagining the case where the predictor internally performs inference in a learned model, i.e. we’re not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don’t think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases.
(We’re actually planning to do a narrower contest focused on ELK proposals.)