Yes, that’s actually the reason why I wanted to tackle the “treacherous turn” first, to look for a general design that would allow us to trust the results from tests and then build on that. I’m seeing as order of priority:
1) make sure we don’t get tricked, so that we can trust the results of what we do;
2) make the AI do the right things.
I’m referring to 1) in here.
Also, as mentioned in another comment to the main post, part of the AI’s utility function is evolving to understand human values, so I still don’t quite see why exactly it shouldn’t work. I envisage the utility function as being the union of two parts, one where we have described the goal for the AI, which shouldn’t be changed with iterations, and another with human values, which will be learnt and updated. This total utility function is common to all agents, including the AI.
“Betrayal” is not the main worry. Given that you prevent the AGI from understanding what people want, it’s likely that it won’t do what people want.
Have you read Bostroms book Superintelligence?
Yes, that’s actually the reason why I wanted to tackle the “treacherous turn” first, to look for a general design that would allow us to trust the results from tests and then build on that. I’m seeing as order of priority: 1) make sure we don’t get tricked, so that we can trust the results of what we do; 2) make the AI do the right things. I’m referring to 1) in here. Also, as mentioned in another comment to the main post, part of the AI’s utility function is evolving to understand human values, so I still don’t quite see why exactly it shouldn’t work. I envisage the utility function as being the union of two parts, one where we have described the goal for the AI, which shouldn’t be changed with iterations, and another with human values, which will be learnt and updated. This total utility function is common to all agents, including the AI.