The programme thesis of Davidad’s agenda to develop provably safe AI has just been published. For context, Davidad is a Programme Director at ARIA who will grant somewhere between £10M and £50M over the next 3 years to pursue his research agenda.
It is the most comprehensive public document detailing his agenda to date.
Here’s the most self-sufficient graph explaining it at a high-level although you’ll have to dive into the details and read it several times to start grasping the many dimensions of it.
I’m personally very excited by Davidad’s moonshot that I currently see as the most credible alternative to scaled transformers, which I consider to be too flawed to be a credible safe path, mostly because:
Ambitious LLM interpretability seems very unlikely to work out:
Why: the failed attempts at making meaningful progress of the past few years + the systematic wall of understanding of ~80% of what’s going on across reverse engineering attempts
Adversarial robustness to jailbreak seems unlikely to work out:
Why: failed attempts at solving it + a theoretical paper of early 2023 that I can’t find right now + increasingly large context windows
Safe generalization with very high confidence seems quite unlikely to work out
Why: absence of theory on transformers + weak interpretability
A key motivation for pursuing moonshots a la Davidad is, as he explains in his thesis, to shift the incentives from the current race to the bottom, by derisking credible paths to AI systems where we have strong reasons to expect confidence in the safety of systems. See the graph below:
Davidad’s Provably Safe AI Architecture—ARIA’s Programme Thesis
Link post
The programme thesis of Davidad’s agenda to develop provably safe AI has just been published. For context, Davidad is a Programme Director at ARIA who will grant somewhere between £10M and £50M over the next 3 years to pursue his research agenda.
It is the most comprehensive public document detailing his agenda to date.
Here’s the most self-sufficient graph explaining it at a high-level although you’ll have to dive into the details and read it several times to start grasping the many dimensions of it.
I’m personally very excited by Davidad’s moonshot that I currently see as the most credible alternative to scaled transformers, which I consider to be too flawed to be a credible safe path, mostly because:
Ambitious LLM interpretability seems very unlikely to work out:
Why: the failed attempts at making meaningful progress of the past few years + the systematic wall of understanding of ~80% of what’s going on across reverse engineering attempts
Adversarial robustness to jailbreak seems unlikely to work out:
Why: failed attempts at solving it + a theoretical paper of early 2023 that I can’t find right now + increasingly large context windows
Safe generalization with very high confidence seems quite unlikely to work out
Why: absence of theory on transformers + weak interpretability
A key motivation for pursuing moonshots a la Davidad is, as he explains in his thesis, to shift the incentives from the current race to the bottom, by derisking credible paths to AI systems where we have strong reasons to expect confidence in the safety of systems. See the graph below: