I think that the training goal of “the AI never makes a catastrophic decision” is unrealistic and unachievable and unnecessary. I think this is not a natural shape for values to take. Consider a highly altruistic man with anger problems, strongly triggered by e.g. a specifc vacation home. If he is present with his wife at this home, he beats her. As long as he starts off away from the home, and knows about his anger problems, he will be motivated to resolve his anger problems, or at least avoid the triggering contexts / take other precautions to ensure her safety.
I think there has not ever existed an agent which makes endorsed choices across all decision-making contexts. Probably even Gandhi would murder someone in some adversarially selected context (even barring extremely optimized adversarial inputs). I don’t think it’s feasible to train a mind with robustly adequate decision-making, and I also don’t think we need to do so in order to get a very aligned AI.
(None of this is locally arguing that adversarial training is bad as part of a training rationale for how we get an aligned agent, just that I don’t see promise in the desired AI cognition of robustly adequate decision-making.)
I think that the training goal of “the AI never makes a catastrophic decision” is unrealistic and unachievable and unnecessary. I think this is not a natural shape for values to take. Consider a highly altruistic man with anger problems, strongly triggered by e.g. a specifc vacation home. If he is present with his wife at this home, he beats her. As long as he starts off away from the home, and knows about his anger problems, he will be motivated to resolve his anger problems, or at least avoid the triggering contexts / take other precautions to ensure her safety.
I think there has not ever existed an agent which makes endorsed choices across all decision-making contexts. Probably even Gandhi would murder someone in some adversarially selected context (even barring extremely optimized adversarial inputs). I don’t think it’s feasible to train a mind with robustly adequate decision-making, and I also don’t think we need to do so in order to get a very aligned AI.
(None of this is locally arguing that adversarial training is bad as part of a training rationale for how we get an aligned agent, just that I don’t see promise in the desired AI cognition of robustly adequate decision-making.)