Fun exercise, but I’m not a fan of the total cartesian doubt phase—I’d rather sacrifice even more corrigibility properties (like how this already isn’t too worried about subagent stability) for better friendliness.
One thing might be that I’d rather have an AI design that’s more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we’ve manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn’t that great an example, because maybe it’s more about not endorsing the “retargeting the search” agenda.
Fun exercise, but I’m not a fan of the total cartesian doubt phase—I’d rather sacrifice even more corrigibility properties (like how this already isn’t too worried about subagent stability) for better friendliness.
Do you have anything specific in mind?
One thing might be that I’d rather have an AI design that’s more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we’ve manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn’t that great an example, because maybe it’s more about not endorsing the “retargeting the search” agenda.