I can barely see how this is possible if we’re talking about alignment to humans, even with a hypothetical formal theory of embedded agency. Do you imagine human values are cleanly represented and extractable, and that we can (potentially very indirectly) reference those values formally? Do you mean something else by “formalization of alignment” that doesn’t involve formal descriptions of human minds?
I can barely see how this is possible if we’re talking about alignment to humans, even with a hypothetical formal theory of embedded agency. Do you imagine human values are cleanly represented and extractable, and that we can (potentially very indirectly) reference those values formally? Do you mean something else by “formalization of alignment” that doesn’t involve formal descriptions of human minds?
For examples of what a formalization of alignment could look like, see this and this.