give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty
Questions can be interpreted in different ways. Especially considering your further suggestion to involve ethicists and philosophers, once someone asks whether “is it moral to nuke Pyongyang”, and I am far from sure you can prove that “yes” is not a truthful answer.
also give it the goal to not alter reality in any way besides answering questions.
Answers can be formulated creatively. “Either thirteen, or we may consider nuking Pyongyang” is a truthful answer to “how much is six plus seven”. Now this is trivial and unlikely to persuade anybody, but perhaps you can imagine far more creative works of sophistry on the output of a superintelligent AI.
ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.
This is opaque. What exactly the question means? You have to specify optimal, and that’s the difficult thing. Unless you are very certain and strict about meaning of “optimal”, you may end up with arbitrary answer.
have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.
Given the history of moral philosophy, I wouldn’t trust a group of ethicists enough. Philosophers can be persuaded to defend a lot of atrocities.
have this go on for a long time until everyone (especially the AI, seeing as it is smarter than anyone else) is certain that there is no flaw in the definition and that it accounts for all kinds of ethical contingencies that might arise after the singularity.
How does the flaw detection process work? What does it mean to have a flaw in a definition?
Questions can be interpreted in different ways. Especially considering your further suggestion to involve ethicists and philosophers, once someone asks whether “is it moral to nuke Pyongyang”, and I am far from sure you can prove that “yes” is not a truthful answer.
Answers can be formulated creatively. “Either thirteen, or we may consider nuking Pyongyang” is a truthful answer to “how much is six plus seven”. Now this is trivial and unlikely to persuade anybody, but perhaps you can imagine far more creative works of sophistry on the output of a superintelligent AI.
This is opaque. What exactly the question means? You have to specify optimal, and that’s the difficult thing. Unless you are very certain and strict about meaning of “optimal”, you may end up with arbitrary answer.
Given the history of moral philosophy, I wouldn’t trust a group of ethicists enough. Philosophers can be persuaded to defend a lot of atrocities.
How does the flaw detection process work? What does it mean to have a flaw in a definition?