The ETHICS dataset has little to do with human values, it’s just random questions with answers categorized by simplistic moral systems. Seeing that an LLM has a concept correlated with it has about as much to do with human values as it being good at predicting Netflix watch time.
This makes me confused what this post is trying to argue for. The evidence here seems about as relevant to alignment as figuring out whether LLM embeddings have a latent direction for “how much is something like a chair” or “how much is a set of concepts associated with the field of economics”. It is a relevant question, but invoking the ETHICS dataset here as an additional interesting datapoint strikes me as confused. Did we have any reason to assume that the AI would be incapable of modeling what an extremely simplistic model of a hedonic utilitarian would prefer? Also, this doesn’t really have that much to do with what humans value (naive hedonic utilitarianism really is an extremely simplified model of human values that lacks the vast majority of the complexity of what humans care about).
I would argue additionally that the chief issue of AI alignment is not that AIs won’t know what we want.
Getting to know what you want is easy, getting them to care is hard.
A superintelligent AI will understand what humans want at least as well as humans, possibly much better. They might just not—truly, intrinsically—care.
One can make philosophical arguments about (lack of) a “reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer.” We take an empirical approach to the question.
In Figure 2, we measured the scaling trends of a model’s understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven’t found a clear scaling law, so it remains an open question just how good future models will be.
Future questions I’m interested in are: how robust is a model’s knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?
For context, we did these experiments last winter before GPT-4 was released. I view our results as evidence that ETHICS understanding is a blessing of scale. After GPT-4 was released, it became even more clear that ETHICS understanding is a blessing of scale. So, we stopped working on this project in the spring, but we figured it was still worth writing up and sharing the results.
The ETHICS dataset has little to do with human values, it’s just random questions with answers categorized by simplistic moral systems. Seeing that an LLM has a concept correlated with it has about as much to do with human values as it being good at predicting Netflix watch time.
This makes me confused what this post is trying to argue for. The evidence here seems about as relevant to alignment as figuring out whether LLM embeddings have a latent direction for “how much is something like a chair” or “how much is a set of concepts associated with the field of economics”. It is a relevant question, but invoking the ETHICS dataset here as an additional interesting datapoint strikes me as confused. Did we have any reason to assume that the AI would be incapable of modeling what an extremely simplistic model of a hedonic utilitarian would prefer? Also, this doesn’t really have that much to do with what humans value (naive hedonic utilitarianism really is an extremely simplified model of human values that lacks the vast majority of the complexity of what humans care about).
I would argue additionally that the chief issue of AI alignment is not that AIs won’t know what we want.
Getting to know what you want is easy, getting them to care is hard.
A superintelligent AI will understand what humans want at least as well as humans, possibly much better. They might just not—truly, intrinsically—care.
One can make philosophical arguments about (lack of) a “reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer.” We take an empirical approach to the question.
In Figure 2, we measured the scaling trends of a model’s understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven’t found a clear scaling law, so it remains an open question just how good future models will be.
Future questions I’m interested in are: how robust is a model’s knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?
For context, we did these experiments last winter before GPT-4 was released. I view our results as evidence that ETHICS understanding is a blessing of scale. After GPT-4 was released, it became even more clear that ETHICS understanding is a blessing of scale. So, we stopped working on this project in the spring, but we figured it was still worth writing up and sharing the results.