Optimal governance interventions depend on progress in technical AI safety. For example, two rough technical safety milestones are have metrics to determine how scary a model’s capabilities are and have a tool to determine whether a model is trying to deceive you. Our governance plans should adapt based on whether these milestones have been achieved (or when it seems they will be achieved) (and for less binary milestones, adapt based on partial progress).
What are more possible milestones in technical AI safety (that might be relevant to governance interventions)?
These aren’t necessarily milestones rather than capabilites that can come on a sliding scale, but:
Tools to accelerate alignment research (which are also tools to accelerate AGI research)
AI assistants for conceptual research
Novel modes of AI-enabled writing
AI assistants for interpretability or AI design
Value learning schemes at various stages along both conceptual and technological development
Low conceptual, high technological: people think it has a lot of holes, but it works well with SOTA AI designs and good tools have been developed to handle the human-interaction parts of the value learning.
High conceptual, low technological: Pretty much everyone is, if not excited by it, not actively worried about it, but it would require developing entirely new infrastructure to use.
That said, I’m not sure how much governance plans should adapt based on milestones. Maybe we should expect governance to be slow to respond, and therefore requiring plans that are good when applied broadly and without much awareness of context.
https://intelligence.org/2017/10/13/fire-alarm/
How is that relevant? It’s about whether AI risk will be mainstream. I’m thinking about governance interventions by this community, which doesn’t require the rest of the world to appreciate AI risk.
I assumed, evidently incorrectly, that the point was to prompt government planners and policymakers with clear ideas now, and say that they will be relevant once X happens—and I don’t think that there is an X such that they will be convinced, short of actual catastrophe.
It now sounds like you looking to do conditional planning for future governance interventions. I’m not sure if that makes sense—it seems pretty clear that groundwork and planning on governance splits between near-term / fast takeoff, and later / slow takeoff, and we’ve been getting clear indications that we’re nearer to the former than the latter—but we aren’t going to develop the interventions materially differently based on specific metrics, since the worlds where almost any of the interventions are effective are not going to be sensitive to that level of detail.
Interesting, thanks.
(I agree in part, but (1) planning for far/slow worlds is still useful and (2) I meant more like metrics or model evaluations are part of an intervention, e.g. incorporated into safety standards than metrics inform what we try to do.)