It seems a bit arrogant to just say “what I’ve been working on,” but on the other hand, the things I’ve been working on have obviously often been my best ideas!
Right now I’m still thinking about how to allow for value specification in hierarchical models. There are two flanks to this problem: the problem of alien concepts and the problem of human underdetermination.
The problem of alien concepts is relatively well-understood: we want the AI to generalize in a human-like way, which runs unto trouble if there are “alien concepts” that predict the training data well but are unsafe to try to maximize. Solving this problem looks like skillful modeling of an environment that includes humans, progress in interpretability, and better learning from human feedback.
The problem of human underdetermination is a bit less appreciated: human behavior underdetermines a utility function, in the sense that you could fit many utility functions to human behavior, all equally well. But there’s simultaneously a problem with human inconsistency with intuitive desiderata. Solving this problem looks like finding ways to model humans that strike a decent balance between our incompatible desiderata, or ways to encode and insert our desiderata to avoid “no free lunch” problems in general models of environments that contain humans. Wheras a lot of good progress has been made on the problem of alien concepts using fairly normal ML methods, I think the problem of human underdetermination requires a combination of philosophy, mathematical foundations, and empirical ML research.
It seems a bit arrogant to just say “what I’ve been working on,” but on the other hand, the things I’ve been working on have obviously often been my best ideas!
Right now I’m still thinking about how to allow for value specification in hierarchical models. There are two flanks to this problem: the problem of alien concepts and the problem of human underdetermination.
The problem of alien concepts is relatively well-understood: we want the AI to generalize in a human-like way, which runs unto trouble if there are “alien concepts” that predict the training data well but are unsafe to try to maximize. Solving this problem looks like skillful modeling of an environment that includes humans, progress in interpretability, and better learning from human feedback.
The problem of human underdetermination is a bit less appreciated: human behavior underdetermines a utility function, in the sense that you could fit many utility functions to human behavior, all equally well. But there’s simultaneously a problem with human inconsistency with intuitive desiderata. Solving this problem looks like finding ways to model humans that strike a decent balance between our incompatible desiderata, or ways to encode and insert our desiderata to avoid “no free lunch” problems in general models of environments that contain humans. Wheras a lot of good progress has been made on the problem of alien concepts using fairly normal ML methods, I think the problem of human underdetermination requires a combination of philosophy, mathematical foundations, and empirical ML research.