How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.