Nice! Purely for my own ease of comprehension I’d have liked a little more translation/analogizing between AI jargon and HCI jargon—e.g. the phrase “active learning” doesn’t appear in the post.
Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good at predicting what aligned behavior should be in out-of-distribution scenarios, but it’s unlikely that AI will be able to figure out what humans want in completely new situations without humans being consulted and kept in the loop.
I disagree in several ways.
Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth—we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we’re almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.
Nice! Purely for my own ease of comprehension I’d have liked a little more translation/analogizing between AI jargon and HCI jargon—e.g. the phrase “active learning” doesn’t appear in the post.
I disagree in several ways.
Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth—we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we’re almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.