If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
“don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.
A Berkeley professor speculates that LLMs are doing something more like “copying the human mind” than “learning from the world.” This seems like it would imply some things we already see (e.g. fuzzily, they’re “not very creative”), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I’m finding it hard to concretize this.
That is, if LLMs are trained with a simple algorithm and acquire functionality that resembles that of the mind, then their underlying algorithm should also resemble the algorithm by which the mind acquires its functionality. However, there is one very different alternative explanation: instead of acquiring its capabilities by observing the world in the same way as humans, LLMs might acquire their capabilities by observing the human mind and copying its function. Instead of implementing a learning process that can learn how the world works, they implement an incredibly indirect process for scanning human brains to construct a crude copy of human cognitive processes.
This is also discussed here. LLMs seem to do a form of imitation learning (evidence: the plateau in base LLM capabilities roughly at human level), while humans, and animals generally, predict reality more directly. The latter is not a form of imitation and therefore not bounded by the abilities of some imitated system.
If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
“don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.
A Berkeley professor speculates that LLMs are doing something more like “copying the human mind” than “learning from the world.” This seems like it would imply some things we already see (e.g. fuzzily, they’re “not very creative”), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I’m finding it hard to concretize this.
This is also discussed here. LLMs seem to do a form of imitation learning (evidence: the plateau in base LLM capabilities roughly at human level), while humans, and animals generally, predict reality more directly. The latter is not a form of imitation and therefore not bounded by the abilities of some imitated system.