I’m really glad this exists. I think having universal frameworks like this that connect the different ways of thinking across the whole field of AI safety are really helpful for allowing people to connect their research to each other and make the field more unified and cohesive as a whole.
Also, you didn’t mention it explicitly in your post, but I think it’s worth pointing out how this breakdown maps onto the outer alignment vs. inner alignment distinction drawn in “Risks from Learned Optimization.” Specifically, I see the outer alignment problem as basically matching directly onto the ideal-design gap and the inner alignment problem as basically matching onto the design-emergent gap. That being said, the framework is a bit different, since inner alignment is specifically about mesa-optimizers in a way in which the design-emergent gap isn’t, so perhaps it’s better to say inner alignment is a subproblem of resolving the design-emergent gap.
Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.
I’m really glad this exists. I think having universal frameworks like this that connect the different ways of thinking across the whole field of AI safety are really helpful for allowing people to connect their research to each other and make the field more unified and cohesive as a whole.
Also, you didn’t mention it explicitly in your post, but I think it’s worth pointing out how this breakdown maps onto the outer alignment vs. inner alignment distinction drawn in “Risks from Learned Optimization.” Specifically, I see the outer alignment problem as basically matching directly onto the ideal-design gap and the inner alignment problem as basically matching onto the design-emergent gap. That being said, the framework is a bit different, since inner alignment is specifically about mesa-optimizers in a way in which the design-emergent gap isn’t, so perhaps it’s better to say inner alignment is a subproblem of resolving the design-emergent gap.
Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.