We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)
We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)