Here’s a diagram of what it would take to solve alignment in the hardest worlds, where something like MIRI’s HRAD agenda is needed. I made this months ago with Thomas Larsen and never got around to posting it (mostly because under my worldview it’s pretty unlikely that we can, or need to, do this), and it probably won’t become a longform at this point. I have not thought about this enough to be highly confident in anything.
This flowchart is under the hypothesis that LLMs have some underlying, mysterious algorithms and data structures that confer intelligence, and that we can in theory apply these to agents constructed by hand, though this would be extremely tedious. Therefore, there are basically three phases: understanding what a HRAD agent would do in theory, reverse-engineering language models, and combining these two directions. The final agent will be a mix of hardcoded things and ML, depending on what is feasible to hardcode and how well we can train ML systems whose robustness and conformation to a spec we are highly confident in.
Theory of abstractions: Also called multi-level models. A mathematical framework for a world-model that contains nodes at different levels of abstraction, such that one can represent concepts like “diamond” and “atom” while respecting consistency between different levels, and be robust to ontology shifts
WM inference = inference on a world-model for an embedded agent, may run in like double exponential time so long as it’s computable
Difference between my model and this flow-chart: I’m hoping that the top branches are actually downstream of LLM reverse-engineering. LLMs do abstract reasoning already, so if you can reverse engineer LLMs, maybe that lets you understand how abstract reasoning works much faster than deriving it yourself.
Tech tree for worst-case/HRAD alignment
Here’s a diagram of what it would take to solve alignment in the hardest worlds, where something like MIRI’s HRAD agenda is needed. I made this months ago with Thomas Larsen and never got around to posting it (mostly because under my worldview it’s pretty unlikely that we can, or need to, do this), and it probably won’t become a longform at this point. I have not thought about this enough to be highly confident in anything.
This flowchart is under the hypothesis that LLMs have some underlying, mysterious algorithms and data structures that confer intelligence, and that we can in theory apply these to agents constructed by hand, though this would be extremely tedious. Therefore, there are basically three phases: understanding what a HRAD agent would do in theory, reverse-engineering language models, and combining these two directions. The final agent will be a mix of hardcoded things and ML, depending on what is feasible to hardcode and how well we can train ML systems whose robustness and conformation to a spec we are highly confident in.
Theory of abstractions: Also called multi-level models. A mathematical framework for a world-model that contains nodes at different levels of abstraction, such that one can represent concepts like “diamond” and “atom” while respecting consistency between different levels, and be robust to ontology shifts
WM inference = inference on a world-model for an embedded agent, may run in like double exponential time so long as it’s computable
Difference between my model and this flow-chart: I’m hoping that the top branches are actually downstream of LLM reverse-engineering. LLMs do abstract reasoning already, so if you can reverse engineer LLMs, maybe that lets you understand how abstract reasoning works much faster than deriving it yourself.