I have been watching this video https://www.youtube.com/watch?v=EUjc1WuyPT8 on AI alignment (something I’m very behind on, my apologies) and it occurred to me that one aspect of the problem is finding a concrete formalized solution to Goodhart’s law-styled problems? Like Yudkowsky was talking about ways that an AGI optimized towards making smiles could go wrong (namely, the AGI could find smarter and smarter ways to effectively give everyone heroin to quickly create lasting smiles) - and it seems like one aspect of this problem is that the metric “smiles” is a measurement for this ambiguous target “wellbeing,” and so when the AGI gives us heroin to make us smile we go “well no, that isn’t what we meant when we said wellbeing.” So we’re trying to find a way to formally write an algorithm for pursuing what we actually mean by wellbeing in a lasting and durable way rather than an algorithm that gets caught optimizing metric that previously measured wellbeing before they were optimized so much. I get that the problem of AI alignment has more facets than just that, but it seems like finding an effective way to tell an AI what wellbeing is rather than telling it things that are currently metrics of wellbeing usually (like smiles) is one facet.
Is this in fact a part of the AI alignment problem, and if so is anyone trying to solve this facet of the problem and where might I go to read more about that? I’ve been sort of interested in meta-ethics for a while, and solving this facet of the problem seems remarkably related to solving important problems in meta-ethics.
Is this in fact a part of the AI alignment problem, and if so is anyone trying to solve this facet of the problem and where might I go to read more about that?
Yes, it’s part of some approaches to the AI alignment problem. It used to be considered more central to AI alignment until people started thinking it might be too hard, and started working on other ways of trying to solve AI alignment that perhaps don’t require “finding an effective way to tell an AI what wellbeing is”. See AI Safety “Success Stories” where “Sovereign Singleton” requires solving this and the others don’t (at least not right away). See also Friendly AI and Coherent Extrapolated Volition.
I have been watching this video https://www.youtube.com/watch?v=EUjc1WuyPT8 on AI alignment (something I’m very behind on, my apologies) and it occurred to me that one aspect of the problem is finding a concrete formalized solution to Goodhart’s law-styled problems? Like Yudkowsky was talking about ways that an AGI optimized towards making smiles could go wrong (namely, the AGI could find smarter and smarter ways to effectively give everyone heroin to quickly create lasting smiles) - and it seems like one aspect of this problem is that the metric “smiles” is a measurement for this ambiguous target “wellbeing,” and so when the AGI gives us heroin to make us smile we go “well no, that isn’t what we meant when we said wellbeing.” So we’re trying to find a way to formally write an algorithm for pursuing what we actually mean by wellbeing in a lasting and durable way rather than an algorithm that gets caught optimizing metric that previously measured wellbeing before they were optimized so much. I get that the problem of AI alignment has more facets than just that, but it seems like finding an effective way to tell an AI what wellbeing is rather than telling it things that are currently metrics of wellbeing usually (like smiles) is one facet.
Is this in fact a part of the AI alignment problem, and if so is anyone trying to solve this facet of the problem and where might I go to read more about that? I’ve been sort of interested in meta-ethics for a while, and solving this facet of the problem seems remarkably related to solving important problems in meta-ethics.
Yes, it’s part of some approaches to the AI alignment problem. It used to be considered more central to AI alignment until people started thinking it might be too hard, and started working on other ways of trying to solve AI alignment that perhaps don’t require “finding an effective way to tell an AI what wellbeing is”. See AI Safety “Success Stories” where “Sovereign Singleton” requires solving this and the others don’t (at least not right away). See also Friendly AI and Coherent Extrapolated Volition.