One can make philosophical arguments about (lack of) a “reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer.” We take an empirical approach to the question.
In Figure 2, we measured the scaling trends of a model’s understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven’t found a clear scaling law, so it remains an open question just how good future models will be.
Future questions I’m interested in are: how robust is a model’s knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?
One can make philosophical arguments about (lack of) a “reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer.” We take an empirical approach to the question.
In Figure 2, we measured the scaling trends of a model’s understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven’t found a clear scaling law, so it remains an open question just how good future models will be.
Future questions I’m interested in are: how robust is a model’s knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?