In fact, the question itself seems superficially similar to the halting problem, where “running off the rails” is the analogue for “halting”
If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don’t halt, and lots that aren’t provable either way. The impossibility of the halting problem is irrelevant, because we don’t need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).
If you’re postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they’re all in the unprovable category, then you’re postulating that friendliness is dissimilar to the halting problem in that respect.
Moreover, the halting problem doesn’t show that the set of programs you can’t decide halting for is in any way interesting.
It’s a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it’s in, and would likely guess it’s provable.
It’s still probably premature to guess whether friendliness is provable when we don’t have any idea what it is. My worry is not that it wouldn’t be possible or provable, but that it might not be a meaningful term at all.
But I also suspect friendliness, if it does mean anything, is in general going to be so complex that “only [needing] to find a single program that provably has behaviour X” may be beyond us. There are lots of mathematical conjectures we can’t prove, even without invoking the halting problem.
One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don’t see any way to prove those starting criteria don’t have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?
If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don’t halt, and lots that aren’t provable either way. The impossibility of the halting problem is irrelevant, because we don’t need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).
If you’re postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they’re all in the unprovable category, then you’re postulating that friendliness is dissimilar to the halting problem in that respect.
Moreover, the halting problem doesn’t show that the set of programs you can’t decide halting for is in any way interesting.
It’s a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it’s in, and would likely guess it’s provable.
It’s still probably premature to guess whether friendliness is provable when we don’t have any idea what it is. My worry is not that it wouldn’t be possible or provable, but that it might not be a meaningful term at all.
But I also suspect friendliness, if it does mean anything, is in general going to be so complex that “only [needing] to find a single program that provably has behaviour X” may be beyond us. There are lots of mathematical conjectures we can’t prove, even without invoking the halting problem.
One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don’t see any way to prove those starting criteria don’t have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?