In humans, it seems the post is somewhat missing what’s going on. Humans are running something like this
...there isn’t any special systematization and concretization process. All the time, there are models running at different levels of the hierarchy, and every layer tries to balance between prediction errors from more concrete layers, and prediction errors from more abstract layers.
How does this relate to “values” … from low-level sensory experience of cold, and fixed prior about body temperature, the AIF system learns more abstract and general “goal-belief” about the need to stay warm, and more abstract sub-goals about clothing, etc. At the end there is a hierarchy of increasingly abstract “goal-beliefs” what I do, expressed relative to the world model.
What’s worth to study here is how human brains manage to keep the hierarchy mostly stable
I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between “low-level systematization” and “high-level systematization”; I’d call the thing you’re describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc).
In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying “humans are constantly doing this” feels like it’s missing something important. Whatever the thing is that some humans are doing way more of than others, that’s what I’m calling high-level systematizing.
Re self-unalignment: that framing feels a bit too abstract for me; I don’t really know what it would mean, concretely, to be “self-aligned”. I do know what it would mean for a human to systematize their values—but as I argue above, it’s neither desirable to fully systematize them nor to fully conserve them. Identifying whether there’s a “correct” amount of systematization to do feels like it will require a theory of cognition and morality that we don’t yet have.
My impression is you get a lot of “the later” if you run “the former” on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g.
rights inherent & inalienable, among which are the preservation of life, & liberty, & the pursuit of happiness
does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the “the former”.
Also, I’m not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to “the former”, informal, intuitive, fuzzy reasonining.
Re self-unalignment: that framing feels a bit too abstract for me; I don’t really know what it would mean, concretely, to be “self-aligned”. I do know what it would mean for a human to systematize their values—but as I argue above, it’s neither desirable to fully systematize them nor to fully conserve them.
That’s interesting—in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that.
In contrast I think in case of humans who you would likely describe as ‘having systematized there values’ … I often doubt what’s going on. A lot people who describe themselves as hardcore utilitarians seem to be … actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed.
Identifying whether there’s a “correct” amount of systematization to do feels like it will require a theory of cognition and morality that we don’t yet have.
That’s where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / “the former” is not what’s going with human theories of morality, the technical problem seems very similar and the same math possibly applies
I agree with Jan here, and how Jan’s comment connects with Thane’s comment elsewhere in this post’s comments.
I think that if ‘you’, as in, ‘your conscious, thinking mind’ chooses to write down that your values are X, where X is some simplified abstract rule system much easier to calculate than the underlying ground level details, then ‘you’ are wrong. The abstract representation is a map, not the territory, of your values. The values are still there, unchanged, hiding. When in a situation where the map finds itself in conflict with the territory, ‘you’ might chose to obey the map. But then you’ll probably feel bad, because you’ll have acted against your true hidden values. Pretending that the map is the new truth of your values is just pretending.
“Systematization” seems like either a special case of the Self-unalignment problem.
In humans, it seems the post is somewhat missing what’s going on. Humans are running something like this
...there isn’t any special systematization and concretization process. All the time, there are models running at different levels of the hierarchy, and every layer tries to balance between prediction errors from more concrete layers, and prediction errors from more abstract layers.
How does this relate to “values” … from low-level sensory experience of cold, and fixed prior about body temperature, the AIF system learns more abstract and general “goal-belief” about the need to stay warm, and more abstract sub-goals about clothing, etc. At the end there is a hierarchy of increasingly abstract “goal-beliefs” what I do, expressed relative to the world model.
What’s worth to study here is how human brains manage to keep the hierarchy mostly stable
I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between “low-level systematization” and “high-level systematization”; I’d call the thing you’re describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc).
In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying “humans are constantly doing this” feels like it’s missing something important. Whatever the thing is that some humans are doing way more of than others, that’s what I’m calling high-level systematizing.
Re self-unalignment: that framing feels a bit too abstract for me; I don’t really know what it would mean, concretely, to be “self-aligned”. I do know what it would mean for a human to systematize their values—but as I argue above, it’s neither desirable to fully systematize them nor to fully conserve them. Identifying whether there’s a “correct” amount of systematization to do feels like it will require a theory of cognition and morality that we don’t yet have.
My impression is you get a lot of “the later” if you run “the former” on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g.
does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the “the former”.
Also, I’m not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to “the former”, informal, intuitive, fuzzy reasonining.
That’s interesting—in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that.
In contrast I think in case of humans who you would likely describe as ‘having systematized there values’ … I often doubt what’s going on. A lot people who describe themselves as hardcore utilitarians seem to be … actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed.
That’s where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / “the former” is not what’s going with human theories of morality, the technical problem seems very similar and the same math possibly applies
I agree with Jan here, and how Jan’s comment connects with Thane’s comment elsewhere in this post’s comments.
I think that if ‘you’, as in, ‘your conscious, thinking mind’ chooses to write down that your values are X, where X is some simplified abstract rule system much easier to calculate than the underlying ground level details, then ‘you’ are wrong. The abstract representation is a map, not the territory, of your values. The values are still there, unchanged, hiding. When in a situation where the map finds itself in conflict with the territory, ‘you’ might chose to obey the map. But then you’ll probably feel bad, because you’ll have acted against your true hidden values. Pretending that the map is the new truth of your values is just pretending.
There’s an even more fundamental problem in terms of ‘hard to pin down concepts’, namely what counts as a ‘human’ in the first place?