Seth Herd comments on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Seth Herd 31 Jan 2024 5:07 UTC
2 points
0
I largely agree. I think you don’t need any those things to have a shot, but you do to be certain.

To your point 1, I think you can reduce the need for very precise interpretability if you make the alignment target simpler. I wrote about this a little here but there’s a lot more to be said and analyzed. That might help with RL techniques too.

If you believe in natural latent variables, which I tend to, those should help with the stability problem you mention.

WRT subagents having different goals, you do need to design it so the primary goals are dominant. Which would be tricky to be certain of. I’d hope q self aware and introspective agent could help enforce that.