Completely fair request. I think I was a bit vague when I said “on top or around AI systems.”
The point here is that I want to find techniques that seem to positively influence model behavior, which I can “staple on” to existing models without a gargantuan engineering effort.
I am especially excited about these ideas if they seem scalable or architecture-agnostic.
Here are a few examples of the kind of research I’m excited about:
Conciseness priors on outputs (as a way to shift cognitive labor to humans)
I think there is a reasonable story for how conciseness priors could improve alignment. This looks like the AI being more eager to use higher level explanations that make humans guess the details, avoiding using ideas that would be very hard to grasp for a human when simpler-to-human ideas would do, being honest and transparent with humans so that we can do a better job filling in the gaps, etc.
If it is indeed the case that strong conciseness priors incentivize transparency-to-humans, then it seems worth investigating 1, how far this effect goes, 2, if there are some tweaks / clever ways to apply pressure toward conciseness that work much better than others for alignment.
I feel pretty optimistic that applying noise to the outputs of an AI system and then having humans reconstruct it yields nice alignment properties.
However, role-playing as a misaligned AI, there was ways that I could get around this process. Mainly, this involved “cheating” the noising process using selective redundancy. I’ve recently been investigating how to recover from this hack via methods like paraphrasing and calculating token salience.
Other regularizations.
Speed priors.
Trying to figure out what sorts of inductive biases have nice alignment properties.
I think betting on language models scaling to ASI would not be crazy.
Investigating how we can influence “reasoning in language” seems promising, stackable, and is plausibly the type of work that can be “stapled on” to future models.
This just seems like something we can build once we have a really advanced language model. It stacks with everything else, and it seems to have great properties.
If someone were to discover a great regularizer for language models, OSNR turns out to work well and I sort out the issues around strategic redundancy, and externalized reasoning oversight turns out to be super promising, then we could just stack all three.
We could then clone the resulting model and run a debate, ior stack on whatever other advances we’ve made by that point. I’m really excited about this sort of modularity, and I guess I’m also pretty optimistic about a few of these techniques having more bite than people may initially guess.
I’m not sure what this would look like—can you give some concrete examples?
Completely fair request. I think I was a bit vague when I said “on top or around AI systems.”
The point here is that I want to find techniques that seem to positively influence model behavior, which I can “staple on” to existing models without a gargantuan engineering effort.
I am especially excited about these ideas if they seem scalable or architecture-agnostic.
Here are a few examples of the kind of research I’m excited about:
Conciseness priors on outputs (as a way to shift cognitive labor to humans)
I think there is a reasonable story for how conciseness priors could improve alignment. This looks like the AI being more eager to use higher level explanations that make humans guess the details, avoiding using ideas that would be very hard to grasp for a human when simpler-to-human ideas would do, being honest and transparent with humans so that we can do a better job filling in the gaps, etc.
If it is indeed the case that strong conciseness priors incentivize transparency-to-humans, then it seems worth investigating 1, how far this effect goes, 2, if there are some tweaks / clever ways to apply pressure toward conciseness that work much better than others for alignment.
OSNR.
I feel pretty optimistic that applying noise to the outputs of an AI system and then having humans reconstruct it yields nice alignment properties.
However, role-playing as a misaligned AI, there was ways that I could get around this process. Mainly, this involved “cheating” the noising process using selective redundancy. I’ve recently been investigating how to recover from this hack via methods like paraphrasing and calculating token salience.
Other regularizations.
Speed priors.
Trying to figure out what sorts of inductive biases have nice alignment properties.
Some of Tamera’s work on externalized reasoning.
I think betting on language models scaling to ASI would not be crazy.
Investigating how we can influence “reasoning in language” seems promising, stackable, and is plausibly the type of work that can be “stapled on” to future models.
AI Safety via Debate
This just seems like something we can build once we have a really advanced language model. It stacks with everything else, and it seems to have great properties.
If someone were to discover a great regularizer for language models, OSNR turns out to work well and I sort out the issues around strategic redundancy, and externalized reasoning oversight turns out to be super promising, then we could just stack all three.
We could then clone the resulting model and run a debate, ior stack on whatever other advances we’ve made by that point. I’m really excited about this sort of modularity, and I guess I’m also pretty optimistic about a few of these techniques having more bite than people may initially guess.