A long time ago, I worked on an idea I called “Context Gems.” A friend asked me to make a short, publicly accessible writeup for it. This is a really rough writeup of the idea, based on some old notes I had lying around.
Context Gems was an attempt at coming up with a theoretical way of safely eliciting superhuman capabilities from an AI. It was a fairly direct off-shoot of e.g. OSNR I later heard it was similar to some ideas Paul Christiano had a long time ago, like imitative generalization/learning the prior.
The goal is to get effective cognitive labor out of a superintelligent but possibly misaligned AI model without causing catastrophe. One insight is that if for a given goal, we want to achieve this goal while receiving a minimum amount of input/information from the superintelligent model. (Another way to phrase this is that we want to maximize the density of “helpful” information in the model’s outputs, to reduce its opportunities for hijacking the human operators).
Our setup is to train/use a strong (possibly superhuman) model to generate short strings (“context gems”). We intend these strings to contain some insight about the problem domain. To verify the usefulness of these short strings, we feed them into context of weaker model, which we can assess on some benchmarks. We may “reward” the strong model based on the performance of the weak model on some problem domain.
I had Claude make me a flow-chart.
If the gems do dramatically improve the performance of the weak model, then perhaps they contain some true insight about the problem domain. Then, if the strings are sufficiently short, we might be able to expose the string to humans without causing catastrophe, hopefully resulting in similar performance improvements for humans.
Some things we might think about:
What should be true of these “context gems”? Perhaps we want them to be short. Are there other restrictions we want to put on them?
What are the right problem domains to select?
If you were an adversarial/misaligned superintelligence, how might you attempt to take over the world given such a setup?
What sort of information might actually go into these context gems?
I’d also be interested in any empirical evidence that such a setup would work.
Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
I don’t think that’s possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker’s capability (thereby defeating the safety point entirely because now it’s just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation… “A benchmark” is no more of a barrier than “flipping a specific image’s class”.
If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans.
Still seems like a useful technique if the more powerful model isn’t much more powerful.
A long time ago, I worked on an idea I called “Context Gems.” A friend asked me to make a short, publicly accessible writeup for it. This is a really rough writeup of the idea, based on some old notes I had lying around.
Context Gems was an attempt at coming up with a theoretical way of safely eliciting superhuman capabilities from an AI. It was a fairly direct off-shoot of e.g. OSNR I later heard it was similar to some ideas Paul Christiano had a long time ago, like imitative generalization/learning the prior.
The goal is to get effective cognitive labor out of a superintelligent but possibly misaligned AI model without causing catastrophe. One insight is that if for a given goal, we want to achieve this goal while receiving a minimum amount of input/information from the superintelligent model. (Another way to phrase this is that we want to maximize the density of “helpful” information in the model’s outputs, to reduce its opportunities for hijacking the human operators).
Our setup is to train/use a strong (possibly superhuman) model to generate short strings (“context gems”). We intend these strings to contain some insight about the problem domain. To verify the usefulness of these short strings, we feed them into context of weaker model, which we can assess on some benchmarks. We may “reward” the strong model based on the performance of the weak model on some problem domain.
I had Claude make me a flow-chart.
If the gems do dramatically improve the performance of the weak model, then perhaps they contain some true insight about the problem domain. Then, if the strings are sufficiently short, we might be able to expose the string to humans without causing catastrophe, hopefully resulting in similar performance improvements for humans.
Some things we might think about:
What should be true of these “context gems”? Perhaps we want them to be short. Are there other restrictions we want to put on them?
What are the right problem domains to select?
If you were an adversarial/misaligned superintelligence, how might you attempt to take over the world given such a setup?
What sort of information might actually go into these context gems?
I’d also be interested in any empirical evidence that such a setup would work.
Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
My reply to both your and @Chris_Leong ’s comment is that you should simply use robust benchmarks on which high performance is interesting.
In the adversarial attack context, the attacker’s objectives are not generally beyond the model’s “capabilities.”
I don’t think that’s possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker’s capability (thereby defeating the safety point entirely because now it’s just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation… “A benchmark” is no more of a barrier than “flipping a specific image’s class”.
We need not provide the strong model with access to the benchmark questions.
Depending on the benchmark, it can be difficult or impossible to encode all the correct responses in a short string.
If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans.
Still seems like a useful technique if the more powerful model isn’t much more powerful.