My interpretation of this post is that before we solve the AI alignment problem, we need an abstraction to describe what agents are, and how they plan, make decisions, and anticipate consequences. It may be easier to describe how this works in general than to describe an optimal agent system, or to describe one that best reflects human psychology. As an analogy, to describe how proteins fold, it is easier to come up with a generalized model, and only then to attempt it in practice, or to predict the folding of specific molecules, or to use such systems to produce useful products.
At first, I was skeptical of the analogy between physics or cryptography and alignment. After all, our empirical observations of the physical world is physics. We can therefore expect that mathematical formalisms will be successful in this area. By contrast, empirical observation of human value is psychology. Psychology is not alignment, which is the basis of the problem.
However, agency seems like tractable ground for mathematical formalism. It encompasses computational, psychological, and physical constraints. I notice that when I try to articulate how agency works, without trying to specify what makes it work well, it becomes much easier to make progress.
My interpretation of this post is that before we solve the AI alignment problem, we need an abstraction to describe what agents are, and how they plan, make decisions, and anticipate consequences. It may be easier to describe how this works in general than to describe an optimal agent system, or to describe one that best reflects human psychology. As an analogy, to describe how proteins fold, it is easier to come up with a generalized model, and only then to attempt it in practice, or to predict the folding of specific molecules, or to use such systems to produce useful products.
At first, I was skeptical of the analogy between physics or cryptography and alignment. After all, our empirical observations of the physical world is physics. We can therefore expect that mathematical formalisms will be successful in this area. By contrast, empirical observation of human value is psychology. Psychology is not alignment, which is the basis of the problem.
However, agency seems like tractable ground for mathematical formalism. It encompasses computational, psychological, and physical constraints. I notice that when I try to articulate how agency works, without trying to specify what makes it work well, it becomes much easier to make progress.