LawrenceC comments on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

LawrenceC 3 Mar 2024 2:07 UTC
9 points
4
If the claim is that it’s easier to learn a covering set for a “true” harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:
1. The Inverse Reward Design paper, which tries to straightforwardly implement the posterior P(True Reward|Human specification of Reward) and act conservatively wrt to this outcome.
2. The Learning preferences from state of the world paper does this for P(True Reward | Initial state of the environment) and also acts conservatively wrt to outcome.
3. [A bunch of other papers which consider this for Reward | Another Source of Evidence, including: randomly sampled rewards and human off-switch pressing, . Also the CIRL paper, which proposes using uncertainty to directly solve the meta problem of “the thing this uncertainty is for”.]
4. It’s discussed as a strategy in Stuart’s Human Compatible, though I don’t have a copy to reference the exact page number.
I also remember finding Rohin’s Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There’s also a lot more academic work in this vein from other research groups/universities too.
The reason I’m not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard.
It’s clear that if you allow H to be “all possible harm predicates”, then an AI that acts conservatively wrt to this is safe, but it’s also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the “true” harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate.
Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the “Bayesian ML is hard” perspective.
Ironically, the state of the art for this when I left CHAI in 2021 were “ask the most capable model (an LLM) to estimate the uncertainty for you in one go” and “ensemble the point estimates of very few but quite capable models” (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the “learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it” approaches. (Anyone who’s currently at CHAI should feel free to correct me here.)
I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I’m not super confident that he’ll do better than “finetune an LLM to help you do it”, which is presumably what we’d be doing anyways?
(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan’s comments above.)