My point is that if it was to instead simply assume humans were an exact copy of itself, so same utility function and same intellectual capabilities it would assume that they would reach the same exact same conclusions and therefore wouldn’t need any forcing, nor any tricks.
Hmm… the idea of having an AI “test itself” is an interesting one for creating honesty, but two concerns immediately come to mind:
The testing environment, or whatever background data the AI receives, may be sufficient evidence for it to infer the true purpose of its test, and thus we’re back to the sincerity problem. (This is one of the reasons why people care about human-intelligibility of the AI structure; if we’re able to see what it’s thinking, it’s much harder for it to hide deceptions from us.)
A core feature of the testing environment / the AI’s method of reasoning about the world may be an explicit acknowledgement that its current value function may differ from the ‘true’ value function that its programmers ‘meant’ to give it, and it has some formal mechanisms to detect and correct any misunderstandings it has. Those formal mechanisms may work at cross purposes with a test on its ability to satisfy its current value function.
Hi Vaniver, yes my point is exactly that of creating honesty, because that would at least allow us to test reliably so it sounds like it should be one of the first steps to aim for. I’ll just write a couple of lines to specify my thought a little further, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”); 2- doesn’t try to learn another utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don’t go to plan, so that it improves its predictions of reality. In order to do this, this “universal” utility function would need to be the result of two parts: 1) the utility function that we initially gave the AI to describe its goal, which I suppose should be unchangeable, and 2) the utility function with the values that it is learning after each iteration, which hopefully should eventually resemble human values as that would make its plans work better eventually. I’m trying to understand whether such a design is technically feasible and whether it would work in the intended way? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable? Seems to me like it would be a good start. It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?
Hmm… the idea of having an AI “test itself” is an interesting one for creating honesty, but two concerns immediately come to mind:
The testing environment, or whatever background data the AI receives, may be sufficient evidence for it to infer the true purpose of its test, and thus we’re back to the sincerity problem. (This is one of the reasons why people care about human-intelligibility of the AI structure; if we’re able to see what it’s thinking, it’s much harder for it to hide deceptions from us.)
A core feature of the testing environment / the AI’s method of reasoning about the world may be an explicit acknowledgement that its current value function may differ from the ‘true’ value function that its programmers ‘meant’ to give it, and it has some formal mechanisms to detect and correct any misunderstandings it has. Those formal mechanisms may work at cross purposes with a test on its ability to satisfy its current value function.
Hi Vaniver, yes my point is exactly that of creating honesty, because that would at least allow us to test reliably so it sounds like it should be one of the first steps to aim for. I’ll just write a couple of lines to specify my thought a little further, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”); 2- doesn’t try to learn another utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don’t go to plan, so that it improves its predictions of reality. In order to do this, this “universal” utility function would need to be the result of two parts: 1) the utility function that we initially gave the AI to describe its goal, which I suppose should be unchangeable, and 2) the utility function with the values that it is learning after each iteration, which hopefully should eventually resemble human values as that would make its plans work better eventually. I’m trying to understand whether such a design is technically feasible and whether it would work in the intended way? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable? Seems to me like it would be a good start. It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?