I’m going to write this out as a pseudo-proof. Please pardon the lack of narrative structure. Conceptually, I’m splitting the problem of value inference into three sub-problems:
Finding a “covering set” of all causal implications of a person’s values. The goal here is to describe a concrete “values” dataset. Modeling that dataset should be sufficient to model values.
Creating a model of that covering set. The goal here is to show that it is feasible to model values, along with a bunch of other stuff that we eventually want to separate out.
Factoring the model to separate the effects of values from the effects of other variables. The goal is to show how to isolate values in a model and thereby get a more robust model of values.
I’m not going to claim that everything in this post is sound or complete. But I do suspect that this process, if iterated on for a fairly short period of time, could lead to a reasonably accurate model of values.
The covering set
Conjecture: Any system that models all of the effects of a person’s values must also model that person’s values.
This should follow from some analog of the Internal Model Principle from control theory.
Proof sketch:
Anything that regulates a variable must create a model of that variable. Specifically, anything that satisfies values must create a model of values.
A regulator can be split into something that makes decisions and something that acts on those decisions. The “actor” is irrelevant for the purposes of modeling a thing.
Given a flexible enough actor to act on decisions, modeling all of the effects of a person’s values is sufficient to satisfy values.
Therefore the ability to model all of the effects of a person’s values implies the ability to model that person’s values.
In practice, the effects of a person’s values may include:
A person’s emotional responses.
The feedback a person gives to others.
The things a person considers important or worthwhile.
The scenarios a person considers ideal.
Hypothesis: The above four points form the basis of all things causally influenced by a person’s values. In other words, any system that models the above four things perfectly must also perfectly model a person’s values.
I’m only putting this hypothesis here for the sake of moving forward. I’m not going to try to defend it, and I wouldn’t know how to begin testing it. Feel free to come up with alternatives or with a new framing entirely.
So, tentatively, “the dataset” would only need to consist of data points on those four things: emotional responses, feedback that people give, what people consider important or worthwhile, and the scenarios people consider ideal.
Creating the model
Hypothesis: The effects of a person’s values can be reasonably well modeled through a large language model.
All of the example effects I gave can be described through language. In fact, they’re probably most easily described through language of the sort current LLMs already recognize.
Validating this hypothesis reduces to demonstrating scaling curves on the above dataset.
A language model would need to model a lot more than just a person’s values to perform well on such a dataset, but that’s okay. As long as it needs to model values well to perform well on the dataset, it’s fine for it to model extraneous things. They’ll be handled in the next step.
Factoring the model
Premise: Values are grounded in measurable physical observables.
Hypothesis: Using the model from Step 2, it’s possible to make otherwise-uninterpretable value-oriented neural signals interpretable. Outline of the process:
Condition the model from Step 2 on (approximations of) these physical observables. To do this, you’ll need a dataset that pairs these physical observables with the sort of data that was in the original dataset.
Make sure the additional conditioning information reduces the perplexity of the model. This is a sanity check to make sure the collected physical observables actually correlate with a person’s values.
Make sure that changing the physical observables provided to the model results in the expected changes to the model’s outputs. This is a sanity check to make sure the physical observables correlate in the expected way with a person’s values.
With that, if you create a model of the conditioning data by any means, you end up with a value model that both matches intuition and is grounded in physical observables.
Why bother
Decomposing the problem makes it easier to think about. Each of the three steps above feels a lot more intuitive and tractable than the problem of “value inference” in its entirety.
It’s modular. It splits up the philosophical work, the computational work, and the scientific work such that each of these things has the absolute minimum dependency on the others (maybe). That means people don’t need to spend a huge amount of time catching up on everything before they’re able to contribute anything.
It takes advantage of machine learning progress. Advances in the ability to model more things will provide more flexibility for Step 1. Advances in creating more efficient models will benefit Step 2. Advances in making models easier to control will benefit Step 3.
It makes disagreements precise. Disagreements about how to perform Step 2 (modeling the effects of values) and Step 3 (the physical observables of values) can be resolved through experiments. Disagreements about Step 1 (what data should be considered relevant) can be discussed over concrete data points rather than abstract arguments.
This seems like a feasible starting point, and, assuming people can ever agree on what “values” are, it should converge onto the true model.
It gives people something to optimize that might actually help with safety.
A foundation model approach to value inference
Epistemic status: shower thoughts.
I’m going to write this out as a pseudo-proof. Please pardon the lack of narrative structure. Conceptually, I’m splitting the problem of value inference into three sub-problems:
Finding a “covering set” of all causal implications of a person’s values. The goal here is to describe a concrete “values” dataset. Modeling that dataset should be sufficient to model values.
Creating a model of that covering set. The goal here is to show that it is feasible to model values, along with a bunch of other stuff that we eventually want to separate out.
Factoring the model to separate the effects of values from the effects of other variables. The goal is to show how to isolate values in a model and thereby get a more robust model of values.
I’m not going to claim that everything in this post is sound or complete. But I do suspect that this process, if iterated on for a fairly short period of time, could lead to a reasonably accurate model of values.
The covering set
Conjecture: Any system that models all of the effects of a person’s values must also model that person’s values.
This should follow from some analog of the Internal Model Principle from control theory.
Proof sketch:
Anything that regulates a variable must create a model of that variable. Specifically, anything that satisfies values must create a model of values.
A regulator can be split into something that makes decisions and something that acts on those decisions. The “actor” is irrelevant for the purposes of modeling a thing.
Given a flexible enough actor to act on decisions, modeling all of the effects of a person’s values is sufficient to satisfy values.
Therefore the ability to model all of the effects of a person’s values implies the ability to model that person’s values.
In practice, the effects of a person’s values may include:
A person’s emotional responses.
The feedback a person gives to others.
The things a person considers important or worthwhile.
The scenarios a person considers ideal.
Hypothesis: The above four points form the basis of all things causally influenced by a person’s values. In other words, any system that models the above four things perfectly must also perfectly model a person’s values.
I’m only putting this hypothesis here for the sake of moving forward. I’m not going to try to defend it, and I wouldn’t know how to begin testing it. Feel free to come up with alternatives or with a new framing entirely.
So, tentatively, “the dataset” would only need to consist of data points on those four things: emotional responses, feedback that people give, what people consider important or worthwhile, and the scenarios people consider ideal.
Creating the model
Hypothesis: The effects of a person’s values can be reasonably well modeled through a large language model.
All of the example effects I gave can be described through language. In fact, they’re probably most easily described through language of the sort current LLMs already recognize.
Validating this hypothesis reduces to demonstrating scaling curves on the above dataset.
A language model would need to model a lot more than just a person’s values to perform well on such a dataset, but that’s okay. As long as it needs to model values well to perform well on the dataset, it’s fine for it to model extraneous things. They’ll be handled in the next step.
Factoring the model
Premise: Values are grounded in measurable physical observables.
Please be careful suggesting that this is infeasible unless you’ve actually looked into the problem.
The physical observables here are neural signals.
Hypothesis: Using the model from Step 2, it’s possible to make otherwise-uninterpretable value-oriented neural signals interpretable. Outline of the process:
Condition the model from Step 2 on (approximations of) these physical observables. To do this, you’ll need a dataset that pairs these physical observables with the sort of data that was in the original dataset.
Make sure the additional conditioning information reduces the perplexity of the model. This is a sanity check to make sure the collected physical observables actually correlate with a person’s values.
Make sure that changing the physical observables provided to the model results in the expected changes to the model’s outputs. This is a sanity check to make sure the physical observables correlate in the expected way with a person’s values.
With that, if you create a model of the conditioning data by any means, you end up with a value model that both matches intuition and is grounded in physical observables.
Why bother
Decomposing the problem makes it easier to think about. Each of the three steps above feels a lot more intuitive and tractable than the problem of “value inference” in its entirety.
It’s modular. It splits up the philosophical work, the computational work, and the scientific work such that each of these things has the absolute minimum dependency on the others (maybe). That means people don’t need to spend a huge amount of time catching up on everything before they’re able to contribute anything.
It takes advantage of machine learning progress. Advances in the ability to model more things will provide more flexibility for Step 1. Advances in creating more efficient models will benefit Step 2. Advances in making models easier to control will benefit Step 3.
It makes disagreements precise. Disagreements about how to perform Step 2 (modeling the effects of values) and Step 3 (the physical observables of values) can be resolved through experiments. Disagreements about Step 1 (what data should be considered relevant) can be discussed over concrete data points rather than abstract arguments.
This seems like a feasible starting point, and, assuming people can ever agree on what “values” are, it should converge onto the true model.
It gives people something to optimize that might actually help with safety.