Maybe you’ve already got this idea on your list of concrete examples, but while I was reading the original post I had an idea for a concrete example of how a parameter-frozen model which was above some critical intelligence threshold could potentially gain dangerous amounts of intelligence / power if given compute, access to data, and data storage.
I’m not saying that this is particularly probable, just that it seems like a path worth adding to a concrete list of ‘ways things could go tricky’.
Imagine you have a model-system which has a large parameter-frozen language model as its primary component, and some sort of simple RL wrapper giving it goals. It could write and read text from its data storage, writing about hypotheses that it then checks in its data or in its training environment, and recording the results. In theory it seems like the system of self-edited data and frozen model could bootstrap its way to being a smarter overall entity, and then design prompts for it’s language model component in a directed way to further accelerate this process, and go completely off-the-rails without even a single additional SGD update. Thus, it could ‘leave behind’ the similarly parameter-frozen auditor model.
[Edit: This writing could include the sort of writing one might imagine a human taking notes for themself might write, perhaps encoded in a self-created cypher. Or it could be like antagonist samples of tokens in weird out-of-distribution combinations, a sort of self-programming. Or some combination of the two. For an intuition pump for the human note style, imagine an immortal human scientist with a poor memory trapped in a very hard escape room, determined to escape. They’d keep notes on their ideas both for how to solve the huge number of intentional puzzles of the room, and also on exploits they discovered, and also on the processes they used to discover these things. If they thought the maker of the room would spy on their notes and adjust the difficulty of the room to keep them forever trapped, then they’d encrypt their notes and devote part of their limited memory to constantly re-memorizing the encryption key.]
Maybe you’ve already got this idea on your list of concrete examples, but while I was reading the original post I had an idea for a concrete example of how a parameter-frozen model which was above some critical intelligence threshold could potentially gain dangerous amounts of intelligence / power if given compute, access to data, and data storage.
I’m not saying that this is particularly probable, just that it seems like a path worth adding to a concrete list of ‘ways things could go tricky’.
Imagine you have a model-system which has a large parameter-frozen language model as its primary component, and some sort of simple RL wrapper giving it goals. It could write and read text from its data storage, writing about hypotheses that it then checks in its data or in its training environment, and recording the results. In theory it seems like the system of self-edited data and frozen model could bootstrap its way to being a smarter overall entity, and then design prompts for it’s language model component in a directed way to further accelerate this process, and go completely off-the-rails without even a single additional SGD update. Thus, it could ‘leave behind’ the similarly parameter-frozen auditor model.
[Edit: This writing could include the sort of writing one might imagine a human taking notes for themself might write, perhaps encoded in a self-created cypher. Or it could be like antagonist samples of tokens in weird out-of-distribution combinations, a sort of self-programming. Or some combination of the two. For an intuition pump for the human note style, imagine an immortal human scientist with a poor memory trapped in a very hard escape room, determined to escape. They’d keep notes on their ideas both for how to solve the huge number of intentional puzzles of the room, and also on exploits they discovered, and also on the processes they used to discover these things. If they thought the maker of the room would spy on their notes and adjust the difficulty of the room to keep them forever trapped, then they’d encrypt their notes and devote part of their limited memory to constantly re-memorizing the encryption key.]