This post points out that many alignment problems can be phrased as embedded
agency problems. It seems to me that they can also all be phrased as
word-boundary problems. More precisely, for each alignment/embedded-agency
problem listed here, there’s a question (or a set of questions) of the form
“what is X?” such that answering that question would go a long way toward
solving the alignment/embedded-agency problem, and vice-versa.
Is this a useful reduction?
The “what is X?” question I see for each problem:
The Keyboard is Not The Human
What does it mean for a person to “say” something (in the abstract sense of the
word)?
Modified Humans
What is a “human”? Furthermore, what does it mean to “modify” or “manipulate” a human?
Off-Equilibrium
What are the meanings of counterfactual statements? For example, what does it
mean to say “We will launch of nukes if you do.”?
Perhaps also, what is a “choice”?
Drinking
What is a “valid profession of one’s values”?
Value Drift
What are a person’s “values”? Focus being on people changing over time.
Akrasia
What is a “person”, and what are a person’s “values”? Focus being on people being make of disparate parts.
Preferences Over Quantum Fields
What are the meanings of abstract, high-level statements? Do they change if your
low-level model of the world fundamentally shifts?
Unrealized Implications
What are a person’s “values”? Focus being on someone knowing A and knowing A->B
but not yet knowing B.
Socially Strategic Self-Modification
What are a person’s “true values”? Focus being on self-modification.
I do think you’re pointing to the right problems—basically the same problems Shminux was pointing at in his comment, and the same problems which I think are the most promising entry point to progress on embedded agency in general.
That said, I think “word boundaries” is a very misleading label for this class of problems. It suggests that the problem is something like “draw a boundary around points in thing-space which correspond to the word ‘tree’”, except for concepts like “values” or “person” rather than “tree”. Drawing a boundary in thing-space isn’t really the objective here; the problem is that we don’t know what the right parameterization of thing-space is or whether that’s even the right framework for grounding these concepts at all.
Here’s how I’d pose it. Over the course of history, humans have figured out how to translate various human intuitions into formal (i.e. mathematical) models. For instance:
Game theory gave a framework for translating intuitions about “strategic behavior” into math
Information theory gave a framework for translating intuitions about information into math
More recently, work on causality gave a framework for translating intuitions about counterfactuals into math
In the early days, people like Galileo showed how to translate physical intuitions into math
A good heuristic: if a class of intuitive reasoning is useful and effective in practice, then there’s probably some framework which would let us translate those intuitions into math. In the case of embedded-agency-related problems, we don’t yet have the framework—just the intuitions.
With that in mind, I’d pose the problem as: build a framework for translating intuitions about “values”, “people”, etc into math. That’s what we mean by the question “what is X?”.
Ooh, that is very insightful. The word-boundary problem around “values” feels fuzzy and ill-defined, but that doesn’t mean that the thing we care about is actually fuzzy and ill-defined.
This post points out that many alignment problems can be phrased as embedded agency problems. It seems to me that they can also all be phrased as word-boundary problems. More precisely, for each alignment/embedded-agency problem listed here, there’s a question (or a set of questions) of the form “what is X?” such that answering that question would go a long way toward solving the alignment/embedded-agency problem, and vice-versa.
Is this a useful reduction?
The “what is X?” question I see for each problem:
The Keyboard is Not The Human
What does it mean for a person to “say” something (in the abstract sense of the word)?
Modified Humans
What is a “human”? Furthermore, what does it mean to “modify” or “manipulate” a human?
Off-Equilibrium
What are the meanings of counterfactual statements? For example, what does it mean to say “We will launch of nukes if you do.”?
Perhaps also, what is a “choice”?
Drinking
What is a “valid profession of one’s values”?
Value Drift
What are a person’s “values”? Focus being on people changing over time.
Akrasia
What is a “person”, and what are a person’s “values”? Focus being on people being make of disparate parts.
Preferences Over Quantum Fields
What are the meanings of abstract, high-level statements? Do they change if your low-level model of the world fundamentally shifts?
Unrealized Implications
What are a person’s “values”? Focus being on someone knowing A and knowing A->B but not yet knowing B.
Socially Strategic Self-Modification
What are a person’s “true values”? Focus being on self-modification.
Yes and no.
I do think you’re pointing to the right problems—basically the same problems Shminux was pointing at in his comment, and the same problems which I think are the most promising entry point to progress on embedded agency in general.
That said, I think “word boundaries” is a very misleading label for this class of problems. It suggests that the problem is something like “draw a boundary around points in thing-space which correspond to the word ‘tree’”, except for concepts like “values” or “person” rather than “tree”. Drawing a boundary in thing-space isn’t really the objective here; the problem is that we don’t know what the right parameterization of thing-space is or whether that’s even the right framework for grounding these concepts at all.
Here’s how I’d pose it. Over the course of history, humans have figured out how to translate various human intuitions into formal (i.e. mathematical) models. For instance:
Game theory gave a framework for translating intuitions about “strategic behavior” into math
Information theory gave a framework for translating intuitions about information into math
More recently, work on causality gave a framework for translating intuitions about counterfactuals into math
In the early days, people like Galileo showed how to translate physical intuitions into math
A good heuristic: if a class of intuitive reasoning is useful and effective in practice, then there’s probably some framework which would let us translate those intuitions into math. In the case of embedded-agency-related problems, we don’t yet have the framework—just the intuitions.
With that in mind, I’d pose the problem as: build a framework for translating intuitions about “values”, “people”, etc into math. That’s what we mean by the question “what is X?”.
Ooh, that is very insightful. The word-boundary problem around “values” feels fuzzy and ill-defined, but that doesn’t mean that the thing we care about is actually fuzzy and ill-defined.