So at first I though this didn’t include a step where the AI learns to care about things—it only learns to model things. But I think actually you’re assuming that we can just directly use the model to pick actions that have predicted good outcomes—which are going to be selected as “good” according the the pre-specified P-properties. This is a flaw because it’s leaving too much hard work for the specifiers to do—we want the environment to do way more work at selecting what’s “good.”
I assume we get an easily interpretable model where the difference between “real strawberries” and “pictures of strawberries” and “things sometimes correlated with strawberries” is easy to define, so we can use the model to directly pick the physical things AI should care about. I’m trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I’m misunderstanding your point?
The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields (“What they want is to obey the laws of physics. What they believe is their local state.”), or your individual cells, etc.
If you’re talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.
The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)
I disagree too, but in a slightly different way. IIRC, John says approximately the following:
All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe.
In this space of abstractions it’s easy to find the abstraction corresponding to e.g. real diamonds.
I think (1) doesn’t need to be true. I say:
By default, humans only care about things they can easily interact with in humanly comprehensible ways. “Things which are easy to interact with in humanly comprehensible ways” should have a simple definition.
Among all “things which are easy to interact with in humanly comprehensible ways”, it’s easy to find the abstraction corresponding to e.g. real diamonds.
I assume we get an easily interpretable model where the difference between “real strawberries” and “pictures of strawberries” and “things sometimes correlated with strawberries” is easy to define, so we can use the model to directly pick the physical things AI should care about. I’m trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I’m misunderstanding your point?
If you’re talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.
I disagree too, but in a slightly different way. IIRC, John says approximately the following:
All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe.
In this space of abstractions it’s easy to find the abstraction corresponding to e.g. real diamonds.
I think (1) doesn’t need to be true. I say:
By default, humans only care about things they can easily interact with in humanly comprehensible ways. “Things which are easy to interact with in humanly comprehensible ways” should have a simple definition.
Among all “things which are easy to interact with in humanly comprehensible ways”, it’s easy to find the abstraction corresponding to e.g. real diamonds.