Hmmm, it’s good to know my thesis wasn’t very clear.
The idea is to train an AI on having our values as its end goals. It doesn’t solve for inner alignment issues, indeed. But say the AI wants to maximize paperclips, then it would be constrained to not damaging our survival etc. while making paperclips.
I was trying to figure out what set of values we are even trying to give an AGI in the first place and this was my best guess: whatever else you do, optimize the instrumental convergence goals of humanity.
Hmmm, it’s good to know my thesis wasn’t very clear.
The idea is to train an AI on having our values as its end goals. It doesn’t solve for inner alignment issues, indeed. But say the AI wants to maximize paperclips, then it would be constrained to not damaging our survival etc. while making paperclips.
I was trying to figure out what set of values we are even trying to give an AGI in the first place and this was my best guess: whatever else you do, optimize the instrumental convergence goals of humanity.