Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.
Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don’t think that would be a local minimum since I’d expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.
So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that’s the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn’t likely to be reinforced in training, unless it can do self-modification or control the training environment.
if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
Neither of those would (immediately) lead to real world goals, because they aren’t targeted at real world state (an optimizing compiler is trying to output a fast program—it isn’t trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.
Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.
Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don’t think that would be a local minimum since I’d expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.
So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.
For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that’s the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn’t likely to be reinforced in training, unless it can do self-modification or control the training environment.
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
Neither of those would (immediately) lead to real world goals, because they aren’t targeted at real world state (an optimizing compiler is trying to output a fast program—it isn’t trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.