Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:
Optimization
Goals
World models
Abstraction
This is the big missing piece for me. Could you elaborate on how you go from trying to find the True Names of human values to things like what is an agent, abstraction, and embeddedness?
Goals makes sense, but the rest are not obvious why they’d be important or relevant. I feel like this reasoning would lead you to thinking about meta-ethics or something, not embeddedness and optimization.
I suspect I’m missing a connecting piece here that would make it all click.
Imagine it’s 1665 and we’re trying to figure out the True Name of physical force—i.e. how hard it feels like something is pushing or pulling.
One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of force, mass, and acceleration. At the time, we already basically had the True Name of mass, but we still needed to figure out the True Name of acceleration.
Why do we need the True Names of mass and acceleration, rather than just trying to figure out force directly? Well, finding robust relationships between the True Names of multiple concepts is, historically, one of the main ways we’ve been able to tell that we have the right Names. We can use e.g. the extension of a spring to measure force, but then what makes us think this operationalization of “force” is going to robustly generalize in the ways we expect? One main reason we expect today’s notion of “force” to robustly generalize is the extremely robust experimental relationship force = mass * acceleration.
(Side note: for mathematical concepts, like e.g. probability and information, the main reason we expect the formulations to robustly generalize is usually the mathematical proof of some relationship, ideally augmented with experimental evidence, rather than just experimental evidence alone.)
Also, while we’re still searching for the relevant Names, relationships between concepts help steer the search—for instance, it’s a lot easier to figure out the True Name of heat once we have the Name of temperature.
Anyway, to answer what I think is your real question here...
A Hypothetical Dialogue
Philosopher: Imagine that Alice and Bob both want an apple, and they ca-
Alignment Researcher: Whoa now, hold up.
Philosopher: … I haven’t even asked the question yet.
Alignment Researcher: What is this “wanting” thing?
Philosopher: You know, it’s this feeling you get where -
Alignment Researcher: I don’t suppose you have any idea how to formulate that mathematically?
Philosopher: How about as a utility function? I hear that formulation has some arguments going for it...
Alignment Researcher: <looks at some math and experimental results> somewhat dubious, but it looks like it’s at least in the right ballpark. Ok, so Alice and Bob both “want” an apple, meaning that (all else equal) they will accept whatever trades (or tradeoffs) give them the apple, and turn down any trades (or tradeoffs) which lose them the apple? Or, in other words, they’re each optimizing to get that apple.
Philosopher: Well, not exactly, we’re not saying Alice and Bob always do the things which get them what they want. “Wanting” isn’t a purely behavioral concept. But you know what, sure, let’s go with that for now. So Alice and Bob both want an apple, and they can’t both -
Alignment Researcher: Ok, hang on, I’m trying to code up this hypothetical in Python, and I’m still unsure about the type-signatures. What are the inputs of the utility functions?
Philosopher: um… <does some googling>… Well, the standard formulation says that Alice and Bob are expected utility maximizers, so the inputs to the utility function will be random variables in their world models.
Alignment Researcher: World models? What the heck are world models??
Philosopher: Well, since we’re modelling them as ideal agents anyway, it’s a Bayesian distribution over a bunch of random variables corresponding to things in the world, which we upd-
Alignment Researcher: “corresponding to things in the world”? I know how to do Bayesian updates on distributions, but it’s not like the variables in those distributions necessarily “correspond to the world” in any particular way. Presumably we need this “correspondence” in order for Alice and Bob’s internal variables representing the “apple” to correspond with some actual apple? Heck, what even is an “actual apple”? That’s important, if we want Alice and Bob to “want” some actual thing in the world, and not just particular activations in their sense-data...
… Anyway, The Point Is
When we try to mathematically formulate even very basic ideas about ethics, we very quickly run into questions about how to formalize agency, world-models, etc.
I understand the point of your dialog, but I also feel like I could model someone saying “This Alignment Researcher is really being pedantic and getting caught in the weeds.” (especially someone who wasn’t sure why these questions should collapse into world models and correspondence.)
(After all, the Philosopher’s question probably didn’t depend on actual apples, and was just using an apple as a stand-in for something with positive utility. So, the inputs of the utility functions could easily be “apples” (where an apple is an object with 1 property, “owner”. Alice prefers apple.owner=”alice” (utility(a): return int(a.owner==‘alice’)), and Bob prefers apple.owner=”bob”) To sidestep the entire question of world models, and correspondence.)
I suspect you did this because the half formed question about apples was easier to come up with than a fully formed question that would necessarily require engagement with world models, and I’m not even sure that’s the wrong choice. But this was the impression I got reading it.
I also wonder about this. If I’m understanding the post and comment right, it’s that if you don’t formulate it mathematically, it doesn’t generalize robustly enough? And that to formulate something mathematically you need to be ridiculously precise/pedantic?
Although this is probably wrong and I’m mostly invoking Cunningham’s Law
I doubt my ability to be entertaining, but perhaps I can be informative. The need for mathematical formulation is because, due to Goodhart’s law, imperfect proxies break down. Mathematics is a tool which is rigorous enough to get us from “that sounds like a pretty good definition” (like “zero correlation” in the radio signals example), to “I’ve proven this is the definition” (like “zero mutual information”).
The proof can get you from “I really hope this works” to “As long as this system satisfies the proof’s assumptions, this will work”, because the proof states it’s assumptions clearly, while “this has worked previously” could, and likely does, rely on a great number of unspecified commonalities previous instances had.
It gets precise and pedantic because it turns out that the things we often want to define for this endeavor are based on other things. “Mutual information” isn’t a useful formulation without a formulation for “information”. Similarly, in trying to define morality, it’s difficult to define what an agent should do in the world (or even what it means for an agent to do things in the world), without ideas of agency and doing, and the world. Every undefined term you use brings you further from a formulation you could actually use to create a proof.
In all, mathematical formulation isn’t the goal, it’s the prerequisite. “Zero correlation” was mathematically formalized, but that was not sufficient.
Why is the Alignment Researcher different than a normal AI researcher?
E.g. Markov decision processes are often conceptualized as “agents” which take “actions” and receive “rewards” etc. and I think none of those terms are “True Names”.
Despite this, when researchers look into ways to give MDP’s some other sort of capability or guarantee, they don’t really seem to prioritize finding True Names. In your dialogue: the AI researcher seems perfectly fine accepting the philosopher’s vaguely defined terms.
What is it about alignment which makes finding True Names such an important strategy, when finding True Names doesn’t seem to be that important for e.g. learning from biased data sets (or any of the other million things AI researchers try to get MDP’s to do)?
This is the big missing piece for me. Could you elaborate on how you go from trying to find the True Names of human values to things like what is an agent, abstraction, and embeddedness?
Goals makes sense, but the rest are not obvious why they’d be important or relevant. I feel like this reasoning would lead you to thinking about meta-ethics or something, not embeddedness and optimization.
I suspect I’m missing a connecting piece here that would make it all click.
Imagine it’s 1665 and we’re trying to figure out the True Name of physical force—i.e. how hard it feels like something is pushing or pulling.
One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of force, mass, and acceleration. At the time, we already basically had the True Name of mass, but we still needed to figure out the True Name of acceleration.
Why do we need the True Names of mass and acceleration, rather than just trying to figure out force directly? Well, finding robust relationships between the True Names of multiple concepts is, historically, one of the main ways we’ve been able to tell that we have the right Names. We can use e.g. the extension of a spring to measure force, but then what makes us think this operationalization of “force” is going to robustly generalize in the ways we expect? One main reason we expect today’s notion of “force” to robustly generalize is the extremely robust experimental relationship force = mass * acceleration.
(Side note: for mathematical concepts, like e.g. probability and information, the main reason we expect the formulations to robustly generalize is usually the mathematical proof of some relationship, ideally augmented with experimental evidence, rather than just experimental evidence alone.)
Also, while we’re still searching for the relevant Names, relationships between concepts help steer the search—for instance, it’s a lot easier to figure out the True Name of heat once we have the Name of temperature.
Anyway, to answer what I think is your real question here...
A Hypothetical Dialogue
Philosopher: Imagine that Alice and Bob both want an apple, and they ca-
Alignment Researcher: Whoa now, hold up.
Philosopher: … I haven’t even asked the question yet.
Alignment Researcher: What is this “wanting” thing?
Philosopher: You know, it’s this feeling you get where -
Alignment Researcher: I don’t suppose you have any idea how to formulate that mathematically?
Philosopher: How about as a utility function? I hear that formulation has some arguments going for it...
Alignment Researcher: <looks at some math and experimental results> somewhat dubious, but it looks like it’s at least in the right ballpark. Ok, so Alice and Bob both “want” an apple, meaning that (all else equal) they will accept whatever trades (or tradeoffs) give them the apple, and turn down any trades (or tradeoffs) which lose them the apple? Or, in other words, they’re each optimizing to get that apple.
Philosopher: Well, not exactly, we’re not saying Alice and Bob always do the things which get them what they want. “Wanting” isn’t a purely behavioral concept. But you know what, sure, let’s go with that for now. So Alice and Bob both want an apple, and they can’t both -
Alignment Researcher: Ok, hang on, I’m trying to code up this hypothetical in Python, and I’m still unsure about the type-signatures. What are the inputs of the utility functions?
Philosopher: um… <does some googling>… Well, the standard formulation says that Alice and Bob are expected utility maximizers, so the inputs to the utility function will be random variables in their world models.
Alignment Researcher: World models? What the heck are world models??
Philosopher: Well, since we’re modelling them as ideal agents anyway, it’s a Bayesian distribution over a bunch of random variables corresponding to things in the world, which we upd-
Alignment Researcher: “corresponding to things in the world”? I know how to do Bayesian updates on distributions, but it’s not like the variables in those distributions necessarily “correspond to the world” in any particular way. Presumably we need this “correspondence” in order for Alice and Bob’s internal variables representing the “apple” to correspond with some actual apple? Heck, what even is an “actual apple”? That’s important, if we want Alice and Bob to “want” some actual thing in the world, and not just particular activations in their sense-data...
… Anyway, The Point Is
When we try to mathematically formulate even very basic ideas about ethics, we very quickly run into questions about how to formalize agency, world-models, etc.
Thank you! This clarifies a lot. The dialogue was the perfect blend of entertaining and informative.
I might see if you can either include it in the original post or post it as a separate one, because it really helps fill in the rationale.
I understand the point of your dialog, but I also feel like I could model someone saying “This Alignment Researcher is really being pedantic and getting caught in the weeds.” (especially someone who wasn’t sure why these questions should collapse into world models and correspondence.)
(After all, the Philosopher’s question probably didn’t depend on actual apples, and was just using an apple as a stand-in for something with positive utility. So, the inputs of the utility functions could easily be “apples” (where an apple is an object with 1 property, “owner”. Alice prefers apple.owner=”alice” (utility(a): return int(a.owner==‘alice’)), and Bob prefers apple.owner=”bob”) To sidestep the entire question of world models, and correspondence.)
I suspect you did this because the half formed question about apples was easier to come up with than a fully formed question that would necessarily require engagement with world models, and I’m not even sure that’s the wrong choice. But this was the impression I got reading it.
I also wonder about this. If I’m understanding the post and comment right, it’s that if you don’t formulate it mathematically, it doesn’t generalize robustly enough? And that to formulate something mathematically you need to be ridiculously precise/pedantic?
Although this is probably wrong and I’m mostly invoking Cunningham’s Law
I doubt my ability to be entertaining, but perhaps I can be informative. The need for mathematical formulation is because, due to Goodhart’s law, imperfect proxies break down. Mathematics is a tool which is rigorous enough to get us from “that sounds like a pretty good definition” (like “zero correlation” in the radio signals example), to “I’ve proven this is the definition” (like “zero mutual information”).
The proof can get you from “I really hope this works” to “As long as this system satisfies the proof’s assumptions, this will work”, because the proof states it’s assumptions clearly, while “this has worked previously” could, and likely does, rely on a great number of unspecified commonalities previous instances had.
It gets precise and pedantic because it turns out that the things we often want to define for this endeavor are based on other things. “Mutual information” isn’t a useful formulation without a formulation for “information”. Similarly, in trying to define morality, it’s difficult to define what an agent should do in the world (or even what it means for an agent to do things in the world), without ideas of agency and doing, and the world. Every undefined term you use brings you further from a formulation you could actually use to create a proof.
In all, mathematical formulation isn’t the goal, it’s the prerequisite. “Zero correlation” was mathematically formalized, but that was not sufficient.
Why is the Alignment Researcher different than a normal AI researcher?
E.g. Markov decision processes are often conceptualized as “agents” which take “actions” and receive “rewards” etc. and I think none of those terms are “True Names”.
Despite this, when researchers look into ways to give MDP’s some other sort of capability or guarantee, they don’t really seem to prioritize finding True Names. In your dialogue: the AI researcher seems perfectly fine accepting the philosopher’s vaguely defined terms.
What is it about alignment which makes finding True Names such an important strategy, when finding True Names doesn’t seem to be that important for e.g. learning from biased data sets (or any of the other million things AI researchers try to get MDP’s to do)?