Going through your arguments and links, I find that there is one single desideratum for AI safety: steer the progress toward the iterative AI development process. Anything else is open-loop and is guaranteed to miss something crucially important. You can prove some no-go theorems, solve the Lobian obstacle, calculate some safety guarantees given certain assumptions, and then reality will still do something unexpected, as it always does.
Basically, my conclusion is the opposite of yours: given that in the open-loop AGI possible worlds our odds of affecting the outcome are pretty dire, while in the closed-loop worlds they are pretty good, the most promising course of action is to focus on shifting the probability of possible worlds from open-loop to closed-loop. Were I at MIRI, I’d “halt and catch fire” given the recent rather unexpected GPT progress and the wealth of non-lethal data it provided, and work on a strategy to enable closing the loop early and often. The current MIRI dogma I would burn is that “we must get it right the first time or else”, and replace it with “we must figure out a way to iterate, or else”. I guess that is what Paul Christiano has been arguing for, and the way AI research labs work.… so probably nothing new here.
The idea that AI Alignment can mostly be solved by iteration is perhaps a unstated crux amongst the AI safety community, since it bears on so many strategy questions.
I’m very interested in mechanistic Interpretability to provide a testing ground for:
* Selection Theorems
* Natural Abstractions
* Shard Theory
* Other theories about neural networks
Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.
I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.
Going through your arguments and links, I find that there is one single desideratum for AI safety: steer the progress toward the iterative AI development process. Anything else is open-loop and is guaranteed to miss something crucially important. You can prove some no-go theorems, solve the Lobian obstacle, calculate some safety guarantees given certain assumptions, and then reality will still do something unexpected, as it always does.
Basically, my conclusion is the opposite of yours: given that in the open-loop AGI possible worlds our odds of affecting the outcome are pretty dire, while in the closed-loop worlds they are pretty good, the most promising course of action is to focus on shifting the probability of possible worlds from open-loop to closed-loop. Were I at MIRI, I’d “halt and catch fire” given the recent rather unexpected GPT progress and the wealth of non-lethal data it provided, and work on a strategy to enable closing the loop early and often. The current MIRI dogma I would burn is that “we must get it right the first time or else”, and replace it with “we must figure out a way to iterate, or else”. I guess that is what Paul Christiano has been arguing for, and the way AI research labs work.… so probably nothing new here.
The idea that AI Alignment can mostly be solved by iteration is perhaps a unstated crux amongst the AI safety community, since it bears on so many strategy questions.
I prefer theory rooted in solid empirical data[1].
I’m sympathetic to iterative cycles, but the empirical data should serve to let us refine and enhance our theories, not to displace them.
Empirical data does not exist in the absence of theory; observations only convey information after interpretation through particular theories.
The power of formal guarantees to:
Apply even as the system scales up
Generalise far out of distribution
Confer very high “all things considered” confidence
Transfer to derivative systems
Apply even under adversarial optimisation?
Remain desiderata that arguments for existential safety of powerful AI systems need to satisfy.
I’m very interested in mechanistic Interpretability to provide a testing ground for:
* Selection Theorems
* Natural Abstractions
* Shard Theory
* Other theories about neural networks
Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.
I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.
I should edit the OP to make that clear.