Yeah, but also this is the sort of response that goes better with citations.
Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans ‘woke up’ inside of evolution, and now are optimizing for goals different from evolution’s goals, despite having ‘perfect training’ in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.
Of course there are still parts that remain to be specified in detail—either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying “I don’t think this hand-wavy argument checks out”, looking into it a bunch, being convinced, and then writing it up in detail), or because we don’t know what we’re looking for yet. (We have a somewhat formal definition of ‘corrigiblity’, but is it the thing that we actually want in our AI designs? It’s not yet clear.)
“Every definition is hand-waved and nothing is specified in detail” is an unfair caricature.
Yeah, but also this is the sort of response that goes better with citations.
Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans ‘woke up’ inside of evolution, and now are optimizing for goals different from evolution’s goals, despite having ‘perfect training’ in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.
Of course there are still parts that remain to be specified in detail—either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying “I don’t think this hand-wavy argument checks out”, looking into it a bunch, being convinced, and then writing it up in detail), or because we don’t know what we’re looking for yet. (We have a somewhat formal definition of ‘corrigiblity’, but is it the thing that we actually want in our AI designs? It’s not yet clear.)