We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle.
If it’s true that that this is both a core alignment problem and we’re not stuck on it, then that’s fantastic. I am not an alignment researcher and don’t feel qualified to comment on quite how promising this work seems, but I find the report both accessible and compelling. I recommend it to anyone curious about where some of the alignment leading edge is.
Also, I find there a striking resemblance to MIRI’s proposed visible thoughts project. They appear to be getting at the same thing though via quite different models (i.e. Bayes nets vs Language models). It’d be amazing if both projects flourished and understanding could be combined from each.
In terms of the relationship to MIRI’s visible thoughts project, I’d say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they’re aiming to do something more like “make it more likely that conditions are such that ELK-like problems can be solved in practice.”
Thanks Ruby! I’m really glad you found the report accessible.
One clarification: Bayes nets aren’t important to ARC’s conception of the problem of ELK or its solution, so I don’t think it makes sense to contrast ARC’s approach against an approach focused on language models or describe it as seeking a solution via Bayes nets.
The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent its understanding of the world in the form of inference on some Bayes net is one of a few simple test cases that ARC uses to check whether the loss functions they’re designing will always incentivize honestly answering straightforward questions.
For example, another simple test case (not included in the report) is that the model could learn to represent its understanding of the world in a bunch of “sentences” that it performs logical operations on to transform into other sentences.
These test cases are settings for counterexamples, but not crucial to proposed solutions. The idea is that if your loss function will always learn a model that answers straightforward questions honestly, it should work in particular for these simplified cases that are easy to think about.
Thanks for the clarification, Ajeya! Sorry to make you have to explain that, it was a mistake to imply that ARC’s conception is specifically anchored on Bayes nets–the report was quite clear that isn’t.
Curated. The authors write:
If it’s true that that this is both a core alignment problem and we’re not stuck on it, then that’s fantastic. I am not an alignment researcher and don’t feel qualified to comment on quite how promising this work seems, but I find the report both accessible and compelling. I recommend it to anyone curious about where some of the alignment leading edge is.
Also, I find there a striking resemblance to MIRI’s proposed visible thoughts project. They appear to be getting at the same thing though via quite different models (i.e. Bayes nets vs Language models). It’d be amazing if both projects flourished and understanding could be combined from each.
In terms of the relationship to MIRI’s visible thoughts project, I’d say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they’re aiming to do something more like “make it more likely that conditions are such that ELK-like problems can be solved in practice.”
Thanks! That’s illuminating.
Thanks Ruby! I’m really glad you found the report accessible.
One clarification: Bayes nets aren’t important to ARC’s conception of the problem of ELK or its solution, so I don’t think it makes sense to contrast ARC’s approach against an approach focused on language models or describe it as seeking a solution via Bayes nets.
The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent its understanding of the world in the form of inference on some Bayes net is one of a few simple test cases that ARC uses to check whether the loss functions they’re designing will always incentivize honestly answering straightforward questions.
For example, another simple test case (not included in the report) is that the model could learn to represent its understanding of the world in a bunch of “sentences” that it performs logical operations on to transform into other sentences.
These test cases are settings for counterexamples, but not crucial to proposed solutions. The idea is that if your loss function will always learn a model that answers straightforward questions honestly, it should work in particular for these simplified cases that are easy to think about.
Thanks for the clarification, Ajeya! Sorry to make you have to explain that, it was a mistake to imply that ARC’s conception is specifically anchored on Bayes nets–the report was quite clear that isn’t.