I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.
[RETRACTED after Scott Aaronson’s reply by email]
I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.