I like this post but I’m a bit confused about why it would ever come up in AI alignment. Since you can’t get an “ought” from an “is”, you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate “trust public-emotional supervisor” from “trust private-calm supervisor”.
Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of “Boltzmann rational agent”, (2) AI tries to match that template to supervisor as best as it can, (3) AI tries to fulfill the inferred goals of the supervisor? Then this post would be saying that we should be open to the possibility that the “best fit” of this template would be very wrong, even if we allow CIRL-like interaction. But I would say that the real problem in this scenario is that the hardcoded template stinks, and we need a better hardcoded template, or else we shouldn’t be using this approach in the first place, at least not by itself. I guess that’s “obvious” to me, but it’s nice to have this concrete example of how it can go wrong, so thanks for that :-)
I like this post but I’m a bit confused about why it would ever come up in AI alignment. Since you can’t get an “ought” from an “is”, you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate “trust public-emotional supervisor” from “trust private-calm supervisor”.
Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of “Boltzmann rational agent”, (2) AI tries to match that template to supervisor as best as it can, (3) AI tries to fulfill the inferred goals of the supervisor? Then this post would be saying that we should be open to the possibility that the “best fit” of this template would be very wrong, even if we allow CIRL-like interaction. But I would say that the real problem in this scenario is that the hardcoded template stinks, and we need a better hardcoded template, or else we shouldn’t be using this approach in the first place, at least not by itself. I guess that’s “obvious” to me, but it’s nice to have this concrete example of how it can go wrong, so thanks for that :-)