Thanks for writing this up! I’ve found this frame to be a really useful way of thinking about GPT-like models since first discussing it.
In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the ‘other methods’ section of ‘Novel methods of process/agent specification’). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI’s current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.
It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:
Something that seems relatively benign/unexciting—fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn’t really change which simulacra are accessible to the simulator.
More worrying things—particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes—the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.
Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There’s an ongoing research project at Conjecture specifically about this, which is the main reason I didn’t emphasize it as a future topic in this sequence. Hopefully we’ll put out a post about our preliminary theoretical and empirical findings soon.
Some interesting threads:
RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT with RL with a KL penalty weighted by 1 is actually equivalent to conditioning the policy on a criteria estimated by the reward model, which is compatible with the simulator formalism.
However, this doesn’t happen in current practice, because 1. both OAI and Anthropic use very small KL penalties (e.g. weighted by 0.001 in Anthropic’s paper—which in the Bayesian inference framework means updating on the “evidence” 1000 times) or maybe none at all 2. early stopping: the RL training does not converge to anything near optimality. Path dependence/distribution shift/inductive biases during RL training seem likely to play a major role in the shape of the posterior policy.
We see empirically that RLHF models (like OAI’s instruct tuned models) do not behave like the original policy conditioned on a natural criteria (e.g. they become often almost deterministic).
Maybe there is a way to do RLHF while preserving the simulator nature of the policy, but the way OAI/Anthropic are doing it now does not, imo
Haven’t yet had a chance to read the article, but from verbal conversations I’d guess they’d endorse something similar (though probably not every word) to Thomas Larsen’s opinion on this in Footnote 5 in this post:
Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues like convergent instrumental goals, power seeking, goodharting, inner alignment failure, etc.
I think inner alignment can be a problem with LLMs trained purely in a self-supervised fashion (e.g., simulacra becoming aware of their surroundings), but I anticipate it to only be a problem with further capabilities. I think RL trained GPT-6 is a lot more likely to be an x-risk than GPT-6 trained only to do text prediction.
Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you’re very against pushing the RL SoTA.
My understanding is that they have very short (by my lights) timelines which recently updated them toward pushing much more toward just trying to automate alignment research rather than thinking about the theory.
Our plan to accelerate alignment does not preclude theoretical thinking, but rather requires it. The mainline agenda atm is not full automation (which I expect to be both more dangerous and less useful in the short term), but what I’ve been calling “cyborgism”: I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas. And the idea is, in part, to amplify the human. If this works, I should be able to do a lot more “thinking about theory” than I am now.
How control/amplification schemes like RLHF might corrupt the nature of simulators is particularly relevant to think about. OAI’s vision of accelerating alignment, for instance, almost certainly relies on RLHF. My guess is that self-supervised learning will be safer and more effective. Even aside from alignment concerns, RLHF instruct tuning makes GPT models worse for the kind of cyborgism II want to do (e.g. it causes mode collapse & cripples semantic generalization, and I want to explore multiverses and steer using arbitrary natural language boundary conditions, not just literal instructions) (although I suspect these are consequences of a more general class of tuning methods than just RLHF, which is one of the things I’d like to understand better).
I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas.
What are your thoughts on failure modes with this approach? (please let me know if any/all of the following seems confused/vanishingly unlikely)
For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.
Suppose that it makes things 10x faster in various directions that look promising, but don’t lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins. In practice, I’d expect the 10x boost to:
Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you’ll look hugely more productive. Who will choose to work on the harder directions?
Note that it won’t be obvious you’re going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact).
Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice....
Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it’ll be difficult not to interpret this as evidence they’re more promising.
Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it’s least able to help us make progress.
It still seems positive-in-expectation if the boost of cyborgism isn’t negatively correlated with the ground-truth usefulness of a direction—but a negative correlation here seems plausible.
Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to me that GPT will be consistently less effective in these directions (to generate these ideas / to take these steps...). Then we may be in terrible-incentive-land. [I’m not claiming that most steps in hard directions will be hard, but that speed of progress asymptotes to progress-per-hard-step]
Of course all this is hand-waving speculation. I’d just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.
So e.g. negative impact through:
Boosting capabilities research.
Creation of undesirable incentives in alignment research.
Warping assessment of research directions.
[other stuff I haven’t thought of]
Do you know of any existing discussion along these lines?
Thanks a lot for this comment. These are extremely valid concerns that we’ve been thinking about a lot.
I’d just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.
I don’t think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it’s unlikely to have any undesirable incentives or negative second order effects, etc, is the holy grail of applied epistemology and one of the cores of the alignment problem.
For now, the best we can do is be aware of these concerns, work to improve our understanding of the underlying epistemological problem, design the tools and methods in a way that avoids problems (or at least make them likely to be noticed) according to our current best understanding, and actively address them in the process.
On a high level, it seems wise to me to follow these principles:
Approach this as an epistemology problem
Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs
Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)
Avoid incentivizing the AI components to goodhart against human evaluation
Avoid producing/releasing infohazards
All of these are hard problems. I could write many pages about each of them, and hopefully will at some point, but for now I’ll only address them briefly in relation to your comment.
1. Approach this as an epistemology problem
We don’t know how to evaluate whether a process is going to be robustly truth-seeking (or {whatever you really want}-seeking). Any measure will be a proxy susceptible to goodhart.
one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.
Suppose that it makes things 10x faster in various directions that look promising, but don’t lead to solutions, but only 2x faster in directions that do lead to solutions.
This is a concern for any method, including things like “post your work frequently and get a lot of feedback” or “try to formalize stuff”
Introducing AI into it just makes the problem much more explicit and pressing (because of the removal of the “protected meta level”).
I intend to work closely with the Conjecture epistemology/methodologies team in this project. After all, this is kinda the ultimate challenge for epistemology: as the saying goes, you don’t understand something until you can build it.
We need to better understand things like:
What are the current bottlenecks on human cognition and more specifically alignment research, and can/do these tools actually help remove them?
is thinking about “bottlenecks” the right abstraction? especially if there’s a potential to completely transform the workflow, instead of just unblocking what we currently recognize as bottlenecks
What do processes that generate good ideas/solutions look like in practice?
What do the examples we have access to tell us about the underlying mechanisms of effective processes?
To what extent are productive processes legible? Can we make them more legible, and what are the costs/benefits of doing so? How do we avoid goodharting against legibility when it’s incentivized (AI assisted research is one such situation)?
How can you evaluate if an idea is actually good, and doesn’t just “look” good?
What are the different ways an idea can “look” good and how can each of these be operationalized or fail? (e.g. “feels” meaningful/makes you feel less confused, experts in the field think it’s good, LW karma, can be formalized/mathematically verified, can be/is experimentally verified, other processes independently arrive at same idea, inspires more ideas, leads to useful applications, “big if true”, etc)
How can we avoid applying too much optimization pressure to things that “look” good considering that we ultimately only have access to how things “look” to us (or some externalized measure)?
How do asymmetric capabilities affect all this? As you said, AI will amplify cognition more effectively in some ways than others.
Humans already have asymmetric capabilities as well (though it’s unclear clear what “symmetry” would mean...). How does this affect how currently we do research?
How do we leverage asymmetric capabilities without over-relying on them?
How can we tell whether capabilities are intrinsically asymmetric or are just asymmetrically bottlenecked by how we’re trying to use them?
Dual to the concerns re asymmetrical capabilities: What kind of truth-seeking processes can AI enable which are outside the scope of how humans currently do research due to cognitive limitations?
Being explicitly aware of these considerations is the first step. For instance, with regards to the concern about perception of progress due to “speed”:
Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it’ll be difficult not to interpret this as evidence they’re more promising.
Obviously you can write much faster and with superficial fluency with an AI assistant, so we need to adjust our evaluation of output in light of that fact.
2. Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs
One common conception of computers is that they’re problem-solving machines: “computer, what is the result of firing this artillery shell in such-and-such a wind [and so on]?”; “computer, what will the maximum temperature in Tokyo be in 5 days?”; “computer, what is the best move to take when the Go board is in this position?”; “computer, how should this image be classified?”; and so on.
This is a conception common to both the early view of computers as number-crunchers, and also in much work on AI, both historically and today. It’s a model of a computer as a way of outsourcing cognition. In speculative depictions of possible future AI, this cognitive outsourcing model often shows up in the view of an AI as an oracle, able to solve some large class of problems with better-than-human performance.
But a very different conception of what computers are for is possible, a conception much more congruent with work on intelligence augmentation.
...
It’s this kind of cognitive transformation model which underlies much of the deepest work on intelligence augmentation. Rather than outsourcing cognition, it’s about changing the operations and representations we use to think; it’s about changing the substrate of thought itself. And so while cognitive outsourcing is important, this cognitive transformation view offers a much more profound model of intelligence augmentation. It’s a view in which computers are a means to change and expand human thought itself.
I think the cognitive transformation approach is more promising from an epistemological standpoint because the point is to give the humans an inside view of the process by weaving the cognitive operations enabled by the AI into the user’s thinking, rather than just producing good-seeming artifacts. In other words, we want to amplify the human’s generator, not just rely on human evaluation of an external generation process.
This does not solve the goodhart problem (you might feel like the AI is improving your cognition without actually being productive), but it enables a form of “supervision” that is closer to the substrate of cognition and thus gives the human more intimate insight into whether and why things are working or not.
I also expect the cognitive transformation model to be significantly more effective in the near future. But as AIs become more capable it will be more tempting to increase the length of feedback loops & supervise outcomes instead of process. Hopefully building tools and gaining hands-on experience now will give us more leverage to continue using AI as cognitive augmentation rather than just outsourcing cognition once the latter becomes “easier”.
In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.
In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned.
Both process- and outcome-based evaluation are attractors to varying degrees: Once an architecture is entrenched, it’s hard to move away from it. This lock-in applies much more to outcome-based systems.
Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air.
So it’s crucial to push toward process-based training now.
A major part of the work here will be designing interfaces which surface the “cognitive primitives” as control levers and make high bandwidth interaction & feedback possible.
Slightly more concretely, GPTs are conditional probability distributions one can control by programming boundary conditions (“prompting”), searching through stochastic ramifications (“curation”), and perhaps also manipulating latents (see this awesome blog post Imagining better interfaces to language models). The probabilistic simulator (or speculator) itself and each of these control methods, I think, have close analogues to how we operate our own minds, and thus I think it’s possible with the right interface to “stitch” the model to our minds in a way that acts as a controllable extension of thought. This is a very different approach to “making GPT useful” than, say, InstructGPT, and it’s why I call it cyborgism.
3. Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)
Short feedback loops and high bandwidth between the human and AI is integral the cognitive augmentation perspective: you want as much of the mission-relevant information to be passing through (and understood by) the human user as possible. Not only is this more helpful to the human, it gives them opportunities to notice problems and course-correct at the process level which may not be transparent at all in more oracle or genie-like approaches.
For similar reasons, we want short feedback loops between the users and designers/engineers of the tools (ideally the user designs the tool—needless to say, I will be among the first of the cyborgs I make). We want to be able to inspect the process on a meta level and notice and address problems like goodhart or mode collapse as soon as possible.
4. Avoid incentivizing the AI components to goodhart against human evaluation
This is obvious but hard to avoid, because we do want to improve the system and human evaluation is the main source of feedback we have. But I think there are concrete ways to avoid the worst here, like being very explicit about where and how much optimization pressure is being applied and avoiding methods which extrapolate proxies of human evaluation with unbounded atomic optimization.
There are various reasons I plan to avoid RLHF (except for purposes of comparison); this is one of them. This is not to say other methods that leverage human feedback are immune to goodhart, but RLHF is particularly risky because you’re creating a model(proxy) of human evaluation of outcomes and optimizing against it (the ability to apply unbounded optimization against the reward model is the reason to make one in the first place rather than training against human judgements directly).
I’m more interested in approaches that interactively prototype effective processes & use them as supervised examples to augment the model’s prior: scouting the space of processes rather than optimizing a fixed measure of what a good outcome looks like. Of course, we must still rely on human judgment to say what a good process is (at various levels of granularity, e.g. curation of AI responses and meta-selection of approaches based on perceived effectiveness), so we still need be wary of goodhart. But I think avoiding direct optimization pressure toward outcome evaluations can go a long way. Supervise Process, not Outcomes contains more in depth reasoning on this point.
That said, it’s important to emphasize that this is not a proposal to solve alignment, but the much easier (though still hard) problem of shaping an AI system to augment alignment research before foom. I don’t expect these methods to scale to aligning a superintelligent AI; I expect conceptual breakthroughs will be necessary for that and iterative approaches alone will fail. The motivation for this project is my belief that AI augmentation can put us in a better position to make those conceptual breakthroughs.
5. Avoid producing/releasing infohazards
I won’t say too much about this now, but anything that we identify to present a risk of accelerating capabilities will be covered under Conjecture’s infohazard policy.
I want to talk about why automation is likely more dangerous and more useful than cyborgization, and the reason is Amdahl’s law.
In other words, the slowest process controls the outcome, and at very high levels, the human is likely to be the biggest bottleneck, since we aren’t special here.
Furthermore, I think that most interesting problems are in the NP complexity class assuming no deceptive alignment has happened. If that’s true, then goodhart that is non-adversarial is not a severe problem even with extreme capabilities, because while getting a solution might be super hard, it’s likely but not proven that p doesn’t equal np, and if that’s true than you can verify whether the solution actually works once you have it easily, even if coming up with solutions are harder.
This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end.
Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.
Thanks for clarifying your views; makes sense that there isn’t a clean distinction between accelerating alignment and theoretical thinking.
I do think there is a distinction between doing theoretical thinking that might be a prerequisite to safely accelerate alignment research substantially, and directly accelerating theoretical alignment. I thought you had updated between these two, toward the second; do you disagree with that?
Thanks for writing this up! I’ve found this frame to be a really useful way of thinking about GPT-like models since first discussing it.
In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the ‘other methods’ section of ‘Novel methods of process/agent specification’). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI’s current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.
It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:
Something that seems relatively benign/unexciting—fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn’t really change which simulacra are accessible to the simulator.
More worrying things—particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes—the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.
Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There’s an ongoing research project at Conjecture specifically about this, which is the main reason I didn’t emphasize it as a future topic in this sequence. Hopefully we’ll put out a post about our preliminary theoretical and empirical findings soon.
Some interesting threads:
RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT with RL with a KL penalty weighted by 1 is actually equivalent to conditioning the policy on a criteria estimated by the reward model, which is compatible with the simulator formalism.
However, this doesn’t happen in current practice, because
1. both OAI and Anthropic use very small KL penalties (e.g. weighted by 0.001 in Anthropic’s paper—which in the Bayesian inference framework means updating on the “evidence” 1000 times) or maybe none at all
2. early stopping: the RL training does not converge to anything near optimality. Path dependence/distribution shift/inductive biases during RL training seem likely to play a major role in the shape of the posterior policy.
We see empirically that RLHF models (like OAI’s instruct tuned models) do not behave like the original policy conditioned on a natural criteria (e.g. they become often almost deterministic).
Maybe there is a way to do RLHF while preserving the simulator nature of the policy, but the way OAI/Anthropic are doing it now does not, imo
Haven’t yet had a chance to read the article, but from verbal conversations I’d guess they’d endorse something similar (though probably not every word) to Thomas Larsen’s opinion on this in Footnote 5 in this post:
Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you’re very against pushing the RL SoTA.
My understanding is that they have very short (by my lights) timelines which recently updated them toward pushing much more toward just trying to automate alignment research rather than thinking about the theory.
Our plan to accelerate alignment does not preclude theoretical thinking, but rather requires it. The mainline agenda atm is not full automation (which I expect to be both more dangerous and less useful in the short term), but what I’ve been calling “cyborgism”: I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas. And the idea is, in part, to amplify the human. If this works, I should be able to do a lot more “thinking about theory” than I am now.
How control/amplification schemes like RLHF might corrupt the nature of simulators is particularly relevant to think about. OAI’s vision of accelerating alignment, for instance, almost certainly relies on RLHF. My guess is that self-supervised learning will be safer and more effective. Even aside from alignment concerns, RLHF instruct tuning makes GPT models worse for the kind of cyborgism II want to do (e.g. it causes mode collapse & cripples semantic generalization, and I want to explore multiverses and steer using arbitrary natural language boundary conditions, not just literal instructions) (although I suspect these are consequences of a more general class of tuning methods than just RLHF, which is one of the things I’d like to understand better).
What are your thoughts on failure modes with this approach?
(please let me know if any/all of the following seems confused/vanishingly unlikely)
For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.
Suppose that it makes things 10x faster in various directions that look promising, but don’t lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins.
In practice, I’d expect the 10x boost to:
Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you’ll look hugely more productive. Who will choose to work on the harder directions?
Note that it won’t be obvious you’re going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact).
Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice....
Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it’ll be difficult not to interpret this as evidence they’re more promising.
Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it’s least able to help us make progress.
It still seems positive-in-expectation if the boost of cyborgism isn’t negatively correlated with the ground-truth usefulness of a direction—but a negative correlation here seems plausible.
Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to me that GPT will be consistently less effective in these directions (to generate these ideas / to take these steps...). Then we may be in terrible-incentive-land.
[I’m not claiming that most steps in hard directions will be hard, but that speed of progress asymptotes to progress-per-hard-step]
Of course all this is hand-waving speculation.
I’d just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.
So e.g. negative impact through:
Boosting capabilities research.
Creation of undesirable incentives in alignment research.
Warping assessment of research directions.
[other stuff I haven’t thought of]
Do you know of any existing discussion along these lines?
Thanks a lot for this comment. These are extremely valid concerns that we’ve been thinking about a lot.
I don’t think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it’s unlikely to have any undesirable incentives or negative second order effects, etc, is the holy grail of applied epistemology and one of the cores of the alignment problem.
For now, the best we can do is be aware of these concerns, work to improve our understanding of the underlying epistemological problem, design the tools and methods in a way that avoids problems (or at least make them likely to be noticed) according to our current best understanding, and actively address them in the process.
On a high level, it seems wise to me to follow these principles:
Approach this as an epistemology problem
Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs
Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)
Avoid incentivizing the AI components to goodhart against human evaluation
Avoid producing/releasing infohazards
All of these are hard problems. I could write many pages about each of them, and hopefully will at some point, but for now I’ll only address them briefly in relation to your comment.
1. Approach this as an epistemology problem
We don’t know how to evaluate whether a process is going to be robustly truth-seeking (or {whatever you really want}-seeking). Any measure will be a proxy susceptible to goodhart.
This is a concern for any method, including things like “post your work frequently and get a lot of feedback” or “try to formalize stuff”
Introducing AI into it just makes the problem much more explicit and pressing (because of the removal of the “protected meta level”).
I intend to work closely with the Conjecture epistemology/methodologies team in this project. After all, this is kinda the ultimate challenge for epistemology: as the saying goes, you don’t understand something until you can build it.
We need to better understand things like:
What are the current bottlenecks on human cognition and more specifically alignment research, and can/do these tools actually help remove them?
is thinking about “bottlenecks” the right abstraction? especially if there’s a potential to completely transform the workflow, instead of just unblocking what we currently recognize as bottlenecks
What do processes that generate good ideas/solutions look like in practice?
What do the examples we have access to tell us about the underlying mechanisms of effective processes?
To what extent are productive processes legible? Can we make them more legible, and what are the costs/benefits of doing so? How do we avoid goodharting against legibility when it’s incentivized (AI assisted research is one such situation)?
How can you evaluate if an idea is actually good, and doesn’t just “look” good?
What are the different ways an idea can “look” good and how can each of these be operationalized or fail? (e.g. “feels” meaningful/makes you feel less confused, experts in the field think it’s good, LW karma, can be formalized/mathematically verified, can be/is experimentally verified, other processes independently arrive at same idea, inspires more ideas, leads to useful applications, “big if true”, etc)
How can we avoid applying too much optimization pressure to things that “look” good considering that we ultimately only have access to how things “look” to us (or some externalized measure)?
How do asymmetric capabilities affect all this? As you said, AI will amplify cognition more effectively in some ways than others.
Humans already have asymmetric capabilities as well (though it’s unclear clear what “symmetry” would mean...). How does this affect how currently we do research?
How do we leverage asymmetric capabilities without over-relying on them?
How can we tell whether capabilities are intrinsically asymmetric or are just asymmetrically bottlenecked by how we’re trying to use them?
Dual to the concerns re asymmetrical capabilities: What kind of truth-seeking processes can AI enable which are outside the scope of how humans currently do research due to cognitive limitations?
Being explicitly aware of these considerations is the first step. For instance, with regards to the concern about perception of progress due to “speed”:
Obviously you can write much faster and with superficial fluency with an AI assistant, so we need to adjust our evaluation of output in light of that fact.
2. Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs
This 2017 article Using Artificial Intelligence to Augment Human Intelligence describes a perspective that I share:
I think the cognitive transformation approach is more promising from an epistemological standpoint because the point is to give the humans an inside view of the process by weaving the cognitive operations enabled by the AI into the user’s thinking, rather than just producing good-seeming artifacts. In other words, we want to amplify the human’s generator, not just rely on human evaluation of an external generation process.
This does not solve the goodhart problem (you might feel like the AI is improving your cognition without actually being productive), but it enables a form of “supervision” that is closer to the substrate of cognition and thus gives the human more intimate insight into whether and why things are working or not.
I also expect the cognitive transformation model to be significantly more effective in the near future. But as AIs become more capable it will be more tempting to increase the length of feedback loops & supervise outcomes instead of process. Hopefully building tools and gaining hands-on experience now will give us more leverage to continue using AI as cognitive augmentation rather than just outsourcing cognition once the latter becomes “easier”.
It occurs to me that I’ve just reiterated the argument for process supervision over outcome supervision:
A major part of the work here will be designing interfaces which surface the “cognitive primitives” as control levers and make high bandwidth interaction & feedback possible.
Slightly more concretely, GPTs are conditional probability distributions one can control by programming boundary conditions (“prompting”), searching through stochastic ramifications (“curation”), and perhaps also manipulating latents (see this awesome blog post Imagining better interfaces to language models). The probabilistic simulator (or speculator) itself and each of these control methods, I think, have close analogues to how we operate our own minds, and thus I think it’s possible with the right interface to “stitch” the model to our minds in a way that acts as a controllable extension of thought. This is a very different approach to “making GPT useful” than, say, InstructGPT, and it’s why I call it cyborgism.
3. Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)
Short feedback loops and high bandwidth between the human and AI is integral the cognitive augmentation perspective: you want as much of the mission-relevant information to be passing through (and understood by) the human user as possible. Not only is this more helpful to the human, it gives them opportunities to notice problems and course-correct at the process level which may not be transparent at all in more oracle or genie-like approaches.
For similar reasons, we want short feedback loops between the users and designers/engineers of the tools (ideally the user designs the tool—needless to say, I will be among the first of the cyborgs I make). We want to be able to inspect the process on a meta level and notice and address problems like goodhart or mode collapse as soon as possible.
4. Avoid incentivizing the AI components to goodhart against human evaluation
This is obvious but hard to avoid, because we do want to improve the system and human evaluation is the main source of feedback we have. But I think there are concrete ways to avoid the worst here, like being very explicit about where and how much optimization pressure is being applied and avoiding methods which extrapolate proxies of human evaluation with unbounded atomic optimization.
There are various reasons I plan to avoid RLHF (except for purposes of comparison); this is one of them. This is not to say other methods that leverage human feedback are immune to goodhart, but RLHF is particularly risky because you’re creating a model(proxy) of human evaluation of outcomes and optimizing against it (the ability to apply unbounded optimization against the reward model is the reason to make one in the first place rather than training against human judgements directly).
I’m more interested in approaches that interactively prototype effective processes & use them as supervised examples to augment the model’s prior: scouting the space of processes rather than optimizing a fixed measure of what a good outcome looks like. Of course, we must still rely on human judgment to say what a good process is (at various levels of granularity, e.g. curation of AI responses and meta-selection of approaches based on perceived effectiveness), so we still need be wary of goodhart. But I think avoiding direct optimization pressure toward outcome evaluations can go a long way. Supervise Process, not Outcomes contains more in depth reasoning on this point.
That said, it’s important to emphasize that this is not a proposal to solve alignment, but the much easier (though still hard) problem of shaping an AI system to augment alignment research before foom. I don’t expect these methods to scale to aligning a superintelligent AI; I expect conceptual breakthroughs will be necessary for that and iterative approaches alone will fail. The motivation for this project is my belief that AI augmentation can put us in a better position to make those conceptual breakthroughs.
5. Avoid producing/releasing infohazards
I won’t say too much about this now, but anything that we identify to present a risk of accelerating capabilities will be covered under Conjecture’s infohazard policy.
I want to talk about why automation is likely more dangerous and more useful than cyborgization, and the reason is Amdahl’s law.
In other words, the slowest process controls the outcome, and at very high levels, the human is likely to be the biggest bottleneck, since we aren’t special here.
Furthermore, I think that most interesting problems are in the NP complexity class assuming no deceptive alignment has happened. If that’s true, then goodhart that is non-adversarial is not a severe problem even with extreme capabilities, because while getting a solution might be super hard, it’s likely but not proven that p doesn’t equal np, and if that’s true than you can verify whether the solution actually works once you have it easily, even if coming up with solutions are harder.
This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end.
Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.
Thanks for clarifying your views; makes sense that there isn’t a clean distinction between accelerating alignment and theoretical thinking.
I do think there is a distinction between doing theoretical thinking that might be a prerequisite to safely accelerate alignment research substantially, and directly accelerating theoretical alignment. I thought you had updated between these two, toward the second; do you disagree with that?