Yes on point Number 1, and partly on point number 2.
If humans don’t have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong—and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.
Yes on point Number 1, and partly on point number 2.
If humans don’t have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong—and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.