From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn’t accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn’t accept any input or accepts some constrained user input, it’s possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it’s not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free.
I claim that the Observer is the easiest ASI to align because it doesn’t accept any instructions after it’s been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don’t see how it could get simpler than that.
Fair enough, you have a lot more experience, and I could be totally wrong on this point.
At this point, if I’m going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method.
Thank you for engaging on this and my previous posts Seth!