I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.
Nice, you’ve expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.
I want you to clarify what this means, and try to get some of the latent variables behind it.
One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed with high probability.
The way I think about blackbox control, the success condition is “getting an alignment solution that we trust”. So another interpretation is you’re saying 30-50% chance of this (maybe conditioned on no low-stakes sabotage)? If so, this implies a guess about the scale of effort required, could you give description of that scale? Because if you’re thinking Manhattan-project-scale with lots of exploration and freedom of experimentation, then this is a very different level of optimism than if you’re thinking about a single PhD project scale.
Why not higher? I don’t see where the inside view uncertainty is coming from. Is it uncertainty over how future training will be done? Or uncertainty over the shape of mind-space-in-general, where accidental implicit biases related to e.g. {”conservativeness”, “trying-hard”, “follow heuristics instead of supergoal reasoning”} might make instrumental reward seeking unlikely by default?