Pedantic point. You say “Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems.” I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
Unclear why Level (1) does not break down into the ‘empirical’ vs ‘human checking’. In particular, how would this belief obtained: “The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm.”
Unclear (but good chance I just need to think more carefully through the concepts) why Level (3) does not collapse to Level (1) too, using same reasoning. Might be related to Martin’s alternative framing.
Pedantic point. You say “Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems.” I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
Unclear why Level (1) does not break down into the ‘empirical’ vs ‘human checking’. In particular, how would this belief obtained: “The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm.”
Unclear (but good chance I just need to think more carefully through the concepts) why Level (3) does not collapse to Level (1) too, using same reasoning. Might be related to Martin’s alternative framing.