This might be a bridge between machine learning and agent foundations that is itself related to alignment. A vague concept could be expressed by a model of machine learning that presents behaviors in a large diverse collection of specific episodes it can make sense of (its scope), exercising its influence as it decides according to its taste.
The machine learning point of view is that we are training the model using all the episodes as the dataset (or maybe for reinforcement learning), with the other things defining the episodes (besides the model itself) giving the data that the model learns. The decision theory point of view is that the model is an adjudicator between the episodes, a shared agent of acausal coordination that intervenes in all of them jointly, that the model presents a single policy that is the updateless decision to arrange all episodes in one possible way, as opposed to other possible ways.
The alignment point of view is that a concept expresses a tiny aspect of preference in its decisions, with its role as an adjudicator giving consistency to that aspect of preference, and coherence to decisions of an agent that relies on the concept. It acts to extend the goodhart scope of the agent as a whole to the episodes that the concept can make sense of. The scope of a concept should be in an equilibrium with its content (behavior), as settled by reflection of learning from the episodes where the concept acts/occurs. Different concepts interact in shared episodes, where their scopes intersect, jointly supplying all the data that makes an episode (when it doesn’t originate as observations of reality, grounding the whole thing). Concepts in this sense are both a way of extending the goodhart scope and of remaining aware of its current locus.
Sorry if I’ll dumb it down too much. I tried to come up with specific examples without terminology. That’s how I understand what you’re saying:
A vague concept can be compared to an agent (AI).
You can use vague concepts to train agents (AIs).
An agent can use a vague concept to define its field of competence.
Simple/absurd examples:
Let’s say we got a bunch movies. And N vague concepts such as “bad movie”, “funny movie” and etc. Each concept is an AI of sorts. Those concepts “discuss” the movies and train each other.
We got a vague concept, such as “health”. And some examples of people who can be healthy or not healthy. Different AIs discuss if a person is healthy or not and train each other.
Let’s say the vague concept is “games”. AI uses this concept to determine what is a game and what is not. Or “implications” of treating something as a game (see “Internal structure, “gradient”″).
This might be a bridge between machine learning and agent foundations that is itself related to alignment.
In this case, could you help me with the topic about “colors”? I wouldn’t write this post if I didn’t write about “colors”. So, this is evidence (?) that the topic about “colors” isn’t insane.
There a “place” is a vague concept. “Spectrum” is a specific context for the place. Meaning is a distribution of “details”. Learning is guessing the correct distribution of details (“color”) for a place in a given context.
So, this is evidence (?) that the topic about “colors” isn’t insane.
I mean, the sketch I’ve written up (mostly here, some bargaining related discussion here) is not very meaningful, it’s like that “Then a miracle occurs” comic. You can make up such things for anything, it’s almost theoretical fiction (not real theory) in a sense analogous to that of historical fiction (not real history). It might be possible to build something out of this, but probably not, like you don’t normally look for practical advice in books of fiction, even though it can sometimes happen to be found there. That’s not what books of fiction are for though, and there could be a scene for self-aware theoretical fiction writers.
I think the interesting point of the sketch is how it naturally puts models of machine learning in the context of acausal decisions from agent foundations, the points of view that are usually disjoint in central examples of either. So maybe there is a way to persist in contorting them in each other’s direction along these lines, and that prompted me to mention it.
I’ve written in response to this post because your definition of vague concepts (at the beginning of the post) seems to fit adjudicators pretty well. In the colors post, there are also references to paradigms, which are less centrally adjudicators, but goodhart scope is their signature feature (a paradigm can famously fail to understand/notice problems that are natural and important for a different point of view).
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
So I think the use of “agent” in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators). (link)
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.
“Adjudicator” is a particular role for agents/policies, and the policies (algorithms that run within episodes) are not necessarily themselves agents (adjudicator-as-agent chooses an adjudicator-as-policy as its decision, in the agent foundations point of view). There is also an “outer agent” I didn’t explicitly discuss that constructs episodes on situations, deciding that certain adjudicators are relevant to a situation and should be given authority to participate in shaping or observing the content of the episode on it. This outer agent is at a different level of sophistication than the adjudicators-as-policies (though not necessarily different from adjudicators-as-agents), and is in a sense built out of the adjudicators, as discussed here.
A vague concept can be compared to an agent (AI).
You can use vague concepts to train agents (AIs).
An agent can use a vague concept to define its field of competence.
So I think the use of “agent” in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators).
This might be a bridge between machine learning and agent foundations that is itself related to alignment. A vague concept could be expressed by a model of machine learning that presents behaviors in a large diverse collection of specific episodes it can make sense of (its scope), exercising its influence as it decides according to its taste.
The machine learning point of view is that we are training the model using all the episodes as the dataset (or maybe for reinforcement learning), with the other things defining the episodes (besides the model itself) giving the data that the model learns. The decision theory point of view is that the model is an adjudicator between the episodes, a shared agent of acausal coordination that intervenes in all of them jointly, that the model presents a single policy that is the updateless decision to arrange all episodes in one possible way, as opposed to other possible ways.
The alignment point of view is that a concept expresses a tiny aspect of preference in its decisions, with its role as an adjudicator giving consistency to that aspect of preference, and coherence to decisions of an agent that relies on the concept. It acts to extend the goodhart scope of the agent as a whole to the episodes that the concept can make sense of. The scope of a concept should be in an equilibrium with its content (behavior), as settled by reflection of learning from the episodes where the concept acts/occurs. Different concepts interact in shared episodes, where their scopes intersect, jointly supplying all the data that makes an episode (when it doesn’t originate as observations of reality, grounding the whole thing). Concepts in this sense are both a way of extending the goodhart scope and of remaining aware of its current locus.
Sorry if I’ll dumb it down too much. I tried to come up with specific examples without terminology. That’s how I understand what you’re saying:
A vague concept can be compared to an agent (AI).
You can use vague concepts to train agents (AIs).
An agent can use a vague concept to define its field of competence.
Simple/absurd examples:
Let’s say we got a bunch movies. And N vague concepts such as “bad movie”, “funny movie” and etc. Each concept is an AI of sorts. Those concepts “discuss” the movies and train each other.
We got a vague concept, such as “health”. And some examples of people who can be healthy or not healthy. Different AIs discuss if a person is healthy or not and train each other.
Let’s say the vague concept is “games”. AI uses this concept to determine what is a game and what is not. Or “implications” of treating something as a game (see “Internal structure, “gradient”″).
In this case, could you help me with the topic about “colors”? I wouldn’t write this post if I didn’t write about “colors”. So, this is evidence (?) that the topic about “colors” isn’t insane.
There a “place” is a vague concept. “Spectrum” is a specific context for the place. Meaning is a distribution of “details”. Learning is guessing the correct distribution of details (“color”) for a place in a given context.
I mean, the sketch I’ve written up (mostly here, some bargaining related discussion here) is not very meaningful, it’s like that “Then a miracle occurs” comic. You can make up such things for anything, it’s almost theoretical fiction (not real theory) in a sense analogous to that of historical fiction (not real history). It might be possible to build something out of this, but probably not, like you don’t normally look for practical advice in books of fiction, even though it can sometimes happen to be found there. That’s not what books of fiction are for though, and there could be a scene for self-aware theoretical fiction writers.
I think the interesting point of the sketch is how it naturally puts models of machine learning in the context of acausal decisions from agent foundations, the points of view that are usually disjoint in central examples of either. So maybe there is a way to persist in contorting them in each other’s direction along these lines, and that prompted me to mention it.
I’ve written in response to this post because your definition of vague concepts (at the beginning of the post) seems to fit adjudicators pretty well. In the colors post, there are also references to paradigms, which are less centrally adjudicators, but goodhart scope is their signature feature (a paradigm can famously fail to understand/notice problems that are natural and important for a different point of view).
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.
“Adjudicator” is a particular role for agents/policies, and the policies (algorithms that run within episodes) are not necessarily themselves agents (adjudicator-as-agent chooses an adjudicator-as-policy as its decision, in the agent foundations point of view). There is also an “outer agent” I didn’t explicitly discuss that constructs episodes on situations, deciding that certain adjudicators are relevant to a situation and should be given authority to participate in shaping or observing the content of the episode on it. This outer agent is at a different level of sophistication than the adjudicators-as-policies (though not necessarily different from adjudicators-as-agents), and is in a sense built out of the adjudicators, as discussed here.
So I think the use of “agent” in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators).