So, this is evidence (?) that the topic about “colors” isn’t insane.
I mean, the sketch I’ve written up (mostly here, some bargaining related discussion here) is not very meaningful, it’s like that “Then a miracle occurs” comic. You can make up such things for anything, it’s almost theoretical fiction (not real theory) in a sense analogous to that of historical fiction (not real history). It might be possible to build something out of this, but probably not, like you don’t normally look for practical advice in books of fiction, even though it can sometimes happen to be found there. That’s not what books of fiction are for though, and there could be a scene for self-aware theoretical fiction writers.
I think the interesting point of the sketch is how it naturally puts models of machine learning in the context of acausal decisions from agent foundations, the points of view that are usually disjoint in central examples of either. So maybe there is a way to persist in contorting them in each other’s direction along these lines, and that prompted me to mention it.
I’ve written in response to this post because your definition of vague concepts (at the beginning of the post) seems to fit adjudicators pretty well. In the colors post, there are also references to paradigms, which are less centrally adjudicators, but goodhart scope is their signature feature (a paradigm can famously fail to understand/notice problems that are natural and important for a different point of view).
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
So I think the use of “agent” in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators). (link)
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.
I mean, the sketch I’ve written up (mostly here, some bargaining related discussion here) is not very meaningful, it’s like that “Then a miracle occurs” comic. You can make up such things for anything, it’s almost theoretical fiction (not real theory) in a sense analogous to that of historical fiction (not real history). It might be possible to build something out of this, but probably not, like you don’t normally look for practical advice in books of fiction, even though it can sometimes happen to be found there. That’s not what books of fiction are for though, and there could be a scene for self-aware theoretical fiction writers.
I think the interesting point of the sketch is how it naturally puts models of machine learning in the context of acausal decisions from agent foundations, the points of view that are usually disjoint in central examples of either. So maybe there is a way to persist in contorting them in each other’s direction along these lines, and that prompted me to mention it.
I’ve written in response to this post because your definition of vague concepts (at the beginning of the post) seems to fit adjudicators pretty well. In the colors post, there are also references to paradigms, which are less centrally adjudicators, but goodhart scope is their signature feature (a paradigm can famously fail to understand/notice problems that are natural and important for a different point of view).
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.