johnswentworth comments on Testing The Natural Abstraction Hypothesis: Project Intro

johnswentworth 8 Apr 2021 16:48 UTC
LW: 7 AF: 6
AF
Re: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this.
The really interesting possibility is that we end up able to precisely specify high-level human concepts—a real-life language of the birds. The specifications would correctly capture what-we-actually-mean, so they wouldn’t be prone to goodhart. That would mean, for instance, being able to formally specify “strawberry on a plate” in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate. Of course, that does not mean that an AI optimizing for that specification would be safe—it would actually produce a strawberry on a plate, but it would still be perfectly happy to take over the world and knock over various vases in the process.
Of course just generally improving the performance of black-box ML is another possibility, but I don’t think this sort of research is likely to induce a step-change in that department; it would just be another incremental improvement. However, if alignment is a bottleneck to extracting economic value from black-box ML systems, then this is the sort of research which would potentially relax that bottleneck without actually solving the full alignment problem. In other words, it would potentially make it easier to produce economically-useful ML systems in the short term, using techniques which lead to AGI disasters in the long term.
- TekhneMakre 10 Apr 2021 6:37 UTC
  3 points
  Parent
  The specifications would correctly capture what-we-actually-mean, so they wouldn’t be prone to goodhart
  I think there’s an ambiguity in “concept” here, that’s important to clarify re/ this hope. Humans use concepts in two ways:
  1. as abstractions in themselves, like the idea of an ideal spring which contains its behavior within the mental object, and
  2. as pointers / promissory notes towards the real objects, like “tree”.
  Seems likely that any agent that has to attend to trees, will form the ~unique concept of “tree”, in the sense of a cluster of things, and minimal sets of dimensions needed to specify the relevant behavior (height, hardness of wood, thickness, whatever). Some of this is like use (1): you can simulate some of the behavior of trees (e.g. how they’ll behave when you try to cut them down and use them to build a cabin). Some of this is like use (2): if you want to know how to grow trees better, you can navigate to instances of real trees, study them to gain further relevant abstractiosn, and then use those new abstractions (nutrient intake, etc.) to grow trees better.
  So what do we mean by “strawberry”, such that it’s not goodhartable? We might mean “a thing that is relevantly naturally abstracted in the same way as a strawberry is relevantly naturally abstracted”. This seems less goodhartable if we use meaning (2), but that’s sort of cheating by pointing to “what we’d think of these strawberrys upon much more reflection in many more contexts of relevance”. If we use meaning (1), that sems eminently goodhartable.