I’ve been accepted as a mentor for the next AI Safety Camp. You can apply to work with me and the team.
The deadline for applicants is November 17.
The program will run from January 11 to April 27.
Summary
Core underlying hypothesis—we believe that there is a significant agency overhang in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.
We see several important research questions that have to be answered:
Is the core assumption even true? We want to prove that one can elicit the peak performance using narrow highly specialised prompts and scaffoldings and locally beat general state-of-the-art performance
How overhang should be factored in in the overall model evaluation procedure?
Is it possible to estimate the real level of overhang (e.g. developing an evaluation technique measuring the gap between current sota performance and theoretically possible peak performance)
How big of an increase has been introduced with existing scaffolding techniques?
We are going to decide on which exact paths to pursue later. We expect to remain flexible and shift between paths when presented with new evidence.
The non-summary
This vector of research has been born as an attempt at tackling the problem known as Sharp left turn.
Sharp Left Turn (SLT) is a highly probable situation that can occur in LLMs when the growth of generalisation ability outpaces the growth of alignment measures which leads to those measures rendered ineffectual which in turn may lead to catastrophic consequences.
Assuming we are going to continue with the transformer + scalable oversight + RLHF paradigm you can imagine SLT as follows. There is a state graph of a model. Via fine tuning we prune out the paths leading towards dangerous states. The generalisation here can be viewed as an increase in the number of paths between any two nodes. In this sense SLT might be viewed as an inability to identify and prune out the new dangerous path at the same rate as they are being introduced.
This is the connection between SLT and scaffolding overhang. Whichever scenario of SLT is more probable, it’s gonna happen on this territory between addressable and thus prunable states and the unaddressed peak states.
By many influential safety researchers, SLT is considered as one of The hard bits of alignment, a critical problem that has to be resolved in order to give a chance for successful ASI.
A rapid phase transition happens if there is a capability overhang: the AI system is improving at various skills continuously, but its improvement in many domains is bottlenecked on one specific skill, and at some point it receives some input that makes its existing capabilities much more effective. Here are some ways this can happen:
Analogy to few-shot prompting: the capabilities are already present in the trained artefact. Any alignment technique that goes through gradient updates becomes irrelevant. Putting the artefact into the “right” situation (e.g., giving it a few-shot prompt) reveals its capabilities relevant to this situation. Mechanism: the relevant knowledge and capabilities are installed by some generic pre training optimisation process.
Note. The perspective on SLT we gave is not the one used by MIRI. This is intentional. (to the best of my knowledge) The focus of their model of SLT is on the shape of capabilities landscape and that the properties leading to the highest performance are the same leading to the treacherous actions. We think this is not a useful operationalisation of the dynamic in the current situation. Instead we aim to (eventually) build a mechanistic model rooted in the current ml paradigm and later on build a conceptual bridge between the two.
Theory of change
Successfully proving that there is a significant margin to be gained using only existing methods can cause a change in the perspective of the governance sector, namely it can brush off the somewhat pristine picture given by the current evaluations measures.
Project plan
We’re gonna start with a literature review of the latest elicitation methods
We are going to investigate current three leading hypothesis of which types of methods lead to peak capabilities:
Better meta thinking strategies based on the notion of model organisms and how to integrate them efficiently [ref]
Prompt generators
The second stage is dedicated to experiments and building a base of precedents
The next stage is about trying to identify generalisable clusters of precedents ranked by the increase in performance compared to default elicitation methods. The goal here is to build a model of an error margin for sota evaluation methods
(longshot) We’re gonna try building the shape of the peak elicitation pipeline to estimate the theoretical limit of current capabilities. Our current best bet is on chain (council) of LLMs specialised in prompting
Optional track—building a map of which alignment agendas contribute to preventing SLT
Backup plan
We’ve combined a pretty flexible list of possible approaches to the problem. We expect to shift between them when necessary.
Output
Desired shape of the result is a private report shareable only with trusted researchers and labs
Minimum Viable Product
The goal is to make a serious attempt at beating sota capability results using more narrow highly specialised prompts / scaffoldings.
Existence of a significant amount of such successes would effectively mean that some (many?) of the current evaluations systematically underestimate the real capabilities of LLMs.
Risks and downsides
Developing new prompting methods potentially may lead to progress in AI capabilities.
Acknowledgements
This research proposal has been developed in close collaboration with Iulia Levin. Iulia’s contribution has been invaluable. We will continue working on this project together. From the perspective of AISC she is an external member of the team.
Team
Research Lead. Anton Zheltoukhov
~9 years of LW exposure =)
Finished AISC (Positive Attractors team led by Robert Kralisch),
The main goal for this role is to explore various prompting techniques, develop new ones, and analyse observation.
Coding experience is a must. Formal ML experience would be great but it is not a deal breaker.
Candidates have to have a good understanding of how transform works, familiar with prompting techniques (e.g. COT, ).
Interpretability engineer
The main goal for this role is same as for Prompt engineer but focus is on “invasive” elicitation methods (e.g. activation steering, …)
On top of requirements for Prompt engineer there is also a requirement for mech interp experience.
Conceptual researcher
The main goal for this role differs from the former ones—it is to try to deconfuse SLT and develop a mechanistic model for it.
Requirements: great conceptual thinking and research skills in general (in ML preferably), strong security mindset, familiarity with threat models landscape
Agency overhang as a proxy for Sharp left turn
I’ve been accepted as a mentor for the next AI Safety Camp. You can apply to work with me and the team.
The deadline for applicants is November 17.
The program will run from January 11 to April 27.
Summary
Core underlying hypothesis—we believe that there is a significant agency overhang in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.
We see several important research questions that have to be answered:
Is the core assumption even true? We want to prove that one can elicit the peak performance using narrow highly specialised prompts and scaffoldings and locally beat general state-of-the-art performance
How overhang should be factored in in the overall model evaluation procedure?
Is it possible to estimate the real level of overhang (e.g. developing an evaluation technique measuring the gap between current sota performance and theoretically possible peak performance)
How big of an increase has been introduced with existing scaffolding techniques?
We are going to decide on which exact paths to pursue later. We expect to remain flexible and shift between paths when presented with new evidence.
The non-summary
This vector of research has been born as an attempt at tackling the problem known as Sharp left turn.
Sharp Left Turn (SLT) is a highly probable situation that can occur in LLMs when the growth of generalisation ability outpaces the growth of alignment measures which leads to those measures rendered ineffectual which in turn may lead to catastrophic consequences.
Assuming we are going to continue with the transformer + scalable oversight + RLHF paradigm you can imagine SLT as follows. There is a state graph of a model. Via fine tuning we prune out the paths leading towards dangerous states. The generalisation here can be viewed as an increase in the number of paths between any two nodes. In this sense SLT might be viewed as an inability to identify and prune out the new dangerous path at the same rate as they are being introduced.
This is the connection between SLT and scaffolding overhang. Whichever scenario of SLT is more probable, it’s gonna happen on this territory between addressable and thus prunable states and the unaddressed peak states.
By many influential safety researchers, SLT is considered as one of The hard bits of alignment, a critical problem that has to be resolved in order to give a chance for successful ASI.
There are many possibilities how SLT may occur. Here we are trying to address only one possible route. Victoria Krakovna and researchers from MIRI made a great analysis of the threat model. Excerpted from Refining the Sharp Left Turn threat model, part 1: claims and mechanisms:
We’ve preliminary considered many angles from which to approach the problem. Focus on eliciting peak capabilities and consequent analysis of the resulting leap in capabilities seems like the best approach.
Note. The perspective on SLT we gave is not the one used by MIRI. This is intentional. (to the best of my knowledge) The focus of their model of SLT is on the shape of capabilities landscape and that the properties leading to the highest performance are the same leading to the treacherous actions. We think this is not a useful operationalisation of the dynamic in the current situation. Instead we aim to (eventually) build a mechanistic model rooted in the current ml paradigm and later on build a conceptual bridge between the two.
Theory of change
Successfully proving that there is a significant margin to be gained using only existing methods can cause a change in the perspective of the governance sector, namely it can brush off the somewhat pristine picture given by the current evaluations measures.
Project plan
We’re gonna start with a literature review of the latest elicitation methods
We are going to investigate current three leading hypothesis of which types of methods lead to peak capabilities:
Domain-specific prompts [ref]
Better meta thinking strategies based on the notion of model organisms and how to integrate them efficiently [ref]
Prompt generators
The second stage is dedicated to experiments and building a base of precedents
The next stage is about trying to identify generalisable clusters of precedents ranked by the increase in performance compared to default elicitation methods. The goal here is to build a model of an error margin for sota evaluation methods
(longshot) We’re gonna try building the shape of the peak elicitation pipeline to estimate the theoretical limit of current capabilities. Our current best bet is on chain (council) of LLMs specialised in prompting
Optional track—building a map of which alignment agendas contribute to preventing SLT
Backup plan
We’ve combined a pretty flexible list of possible approaches to the problem. We expect to shift between them when necessary.
Output
Desired shape of the result is a private report shareable only with trusted researchers and labs
Minimum Viable Product
The goal is to make a serious attempt at beating sota capability results using more narrow highly specialised prompts / scaffoldings.
Existence of a significant amount of such successes would effectively mean that some (many?) of the current evaluations systematically underestimate the real capabilities of LLMs.
Risks and downsides
Developing new prompting methods potentially may lead to progress in AI capabilities.
Acknowledgements
This research proposal has been developed in close collaboration with Iulia Levin. Iulia’s contribution has been invaluable. We will continue working on this project together. From the perspective of AISC she is an external member of the team.
Team
Research Lead. Anton Zheltoukhov
~9 years of LW exposure =)
Finished AISC (Positive Attractors team led by Robert Kralisch),
Finished ARENA.
On and off working on a personal conceptual blue-sky-like agenda called Narrative Theory. It has been partially published on LW.
Have 6 years in tech as dev/qa under my belt.
Time commitment: 15-20 hours per week
Roles and skill requirements
Prompt engineer
The main goal for this role is to explore various prompting techniques, develop new ones, and analyse observation.
Coding experience is a must. Formal ML experience would be great but it is not a deal breaker.
Candidates have to have a good understanding of how transform works, familiar with prompting techniques (e.g. COT, ).
Interpretability engineer
The main goal for this role is same as for Prompt engineer but focus is on “invasive” elicitation methods (e.g. activation steering, …)
On top of requirements for Prompt engineer there is also a requirement for mech interp experience.
Conceptual researcher
The main goal for this role differs from the former ones—it is to try to deconfuse SLT and develop a mechanistic model for it.
Requirements: great conceptual thinking and research skills in general (in ML preferably), strong security mindset, familiarity with threat models landscape
Team size
2-4 Prompt engineers
1-3 Interpretability engineers
1-2 Conceptual researchers
Reference set
SLT
A central AI alignment problem: capabilities generalization, and the sharp left turn
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Evals
A Survey on Evaluation of Large Language Models
How evals might (or might not) prevent catastrophic risks from A
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Discovering Language Model Behaviors with Model-Written Evaluations
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
https://github.com/Hannibal046/Awesome-LLM?tab=readme-ov-file#llm-evaluation
Elicitation methods
The Prompt Report: A Systematic Survey of Prompting Techniques
https://www.lesswrong.com/tag/activation-engineering
https://github.com/snwfdhmp/awesome-gpt-prompt-engineering
One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Overhang
The Agency Overhang
https://ai-improving-ai.safe.ai/
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task