Yes, the Auto-GPT approach does evaluate its potential plans against the goals it was given. So all you have to do to get some decent starting alignment is give it good high-level goals (which isn’t trivial; don’t tell it to reduce suffering or you may find out too late it had a solution you didn’t intend...). But because it’s also pretty easily interpretable, and can be made to at least start as corrigible with good top-level goals, there’s a shot at correcting your alignment mistakes as they arise.
Yes, the Auto-GPT approach does evaluate its potential plans against the goals it was given. So all you have to do to get some decent starting alignment is give it good high-level goals (which isn’t trivial; don’t tell it to reduce suffering or you may find out too late it had a solution you didn’t intend...). But because it’s also pretty easily interpretable, and can be made to at least start as corrigible with good top-level goals, there’s a shot at correcting your alignment mistakes as they arise.