Thanks for your reply. Popper-falsifiable does not mean experiment-based in my books. Math is falsifiable—you can present a counterexample, error in reasoning, a paradoxical result, etc. Similarly to history, you can often falsify certain claims by providing evidence against. But you can not falsify a field where every definition is hand-waved and nothing is specified in detail. I agree that AI Alignment has pre-paradigmic features as far as Kuhn goes. But Kuhn also says that pre-paradigmic science is rarely rigorous or true, even though it might produce some results that will lead to something interesting in the future.
Yeah, but also this is the sort of response that goes better with citations.
Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans ‘woke up’ inside of evolution, and now are optimizing for goals different from evolution’s goals, despite having ‘perfect training’ in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.
Of course there are still parts that remain to be specified in detail—either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying “I don’t think this hand-wavy argument checks out”, looking into it a bunch, being convinced, and then writing it up in detail), or because we don’t know what we’re looking for yet. (We have a somewhat formal definition of ‘corrigiblity’, but is it the thing that we actually want in our AI designs? It’s not yet clear.)
In terms of trying to formulate rigorous and consistent definitions, a major goal of the Causal Incentives Working Group is to analyse features of different problems using consistent definitions and a shared framework. In particular, our paper “Path-specific Objectives for Safer Agent Incentives” (AAAI-2022) will go online in about month, and should serve to organize a handful of papers in AIS.
Thanks for your reply. Popper-falsifiable does not mean experiment-based in my books. Math is falsifiable—you can present a counterexample, error in reasoning, a paradoxical result, etc. Similarly to history, you can often falsify certain claims by providing evidence against. But you can not falsify a field where every definition is hand-waved and nothing is specified in detail. I agree that AI Alignment has pre-paradigmic features as far as Kuhn goes. But Kuhn also says that pre-paradigmic science is rarely rigorous or true, even though it might produce some results that will lead to something interesting in the future.
“Every definition is hand-waved and nothing is specified in detail” is an unfair caricature.
Yeah, but also this is the sort of response that goes better with citations.
Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans ‘woke up’ inside of evolution, and now are optimizing for goals different from evolution’s goals, despite having ‘perfect training’ in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.
Of course there are still parts that remain to be specified in detail—either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying “I don’t think this hand-wavy argument checks out”, looking into it a bunch, being convinced, and then writing it up in detail), or because we don’t know what we’re looking for yet. (We have a somewhat formal definition of ‘corrigiblity’, but is it the thing that we actually want in our AI designs? It’s not yet clear.)
In terms of trying to formulate rigorous and consistent definitions, a major goal of the Causal Incentives Working Group is to analyse features of different problems using consistent definitions and a shared framework. In particular, our paper “Path-specific Objectives for Safer Agent Incentives” (AAAI-2022) will go online in about month, and should serve to organize a handful of papers in AIS.
Thanks, this looks very good.