IMO there’s a fair chance that it’s much easier to do alignment given WBE, since it gives you a formal specification of the entire human policy instead of just some samples. For example, we might be able to go from policy to utility function using the AIT definition of goal-directedness. So, there is some case for doing WBE before TAI if that’s feasible.
IMO there’s a fair chance that it’s much easier to do alignment given WBE, since it gives you a formal specification of the entire human policy instead of just some samples. For example, we might be able to go from policy to utility function using the AIT definition of goal-directedness. So, there is some case for doing WBE before TAI if that’s feasible.