Caveat that I have only skimmed this. I don’t think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I’m not the main audience. My reactions:
Section 4: Behavioral training is an imprecise way to specify goals—most the sub-arguments have me going “shrug, maybe”. I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
Section 5: Control of misaligned AIs is difficult—This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood’s recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3).
“Misaligned goal-directed AI won’t cooperate with humans” Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it’s plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there’s definitely some other solution space here.]. I think this subsection title is possible but unlikely.
“There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it.
“It only applies to relatively easy to check research results” seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
“In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top.” I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
“It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps.” I agree that capability elicitation seems non-trivial in some cases, but I’m reasonably optimistic that future work will improve the situation.
Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
Section 6: Powerful misaligned AI would be bad—I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn’t really engage with those arguments in a way that sways me much.
Thanks for reading it, it’s good to know exactly where you think the argument is weakest and I appreciate the effort of going through and noting differences.
On section 4:
I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
This surprises me actually, I thought this section was solid conditional on the previous assumptions. I think you shouldn’t think of them as problems that might arise in practice, instead they should be thought of as reasons why behavioral training under-specifies some parts of the AI algorithm. I’ll give some more quick examples of “outer-shell constraints”, because examples are very common and it seems odd for this to be called speculative: Unendorsed habits or addictions in humans; the instinct to breathe in when you’re trying to hold your breath; appending a checker algorithm C to any algorithm A, to double check the output of A; any biases in a search ordering (as in the alphago example).
On section 5: It’s on my todo list to write a comment responding to some of the specifics of Redwood’s control post. This post wasn’t intending to engage with their detailed arguments, just give the top-level reasons why we believe control is difficult.
Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. The crux here might be “if we had 20 phd students, and incentivize them to both look like they are doing checkable research, and also incentivize them not to make any real useful progress on solving a particular difficult problem, would this be very useful without costing a lot of expert supervision?”. If I remember correctly, this was a crux between myself and Ryan during an in person discussion. It seems likely to me that the phd students, trying not to be useful while looking good in checkable ways, will succeed at both. More intelligent agent are even more likely to succeed at both. It does depend on the type of research, I agree that formal theorem-proving would be much easier to incentivize properly. I’m talking about the least-easily-checkable 20% of useful research.
Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is
Seems easy enough to predict given roughly human-scientist-team level of capabilities. And seems hard to get useful work out of human-scientist-team without them also working out ways to (maybe imperfectly) distinguish “faked sensory inputs” from real ones, and without leaking any information about what time it is. E.g. we need the AI to build on previous work
“There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains.
Nice, this is my other crux with Ryan, which we phrased roughly as “how data efficient can we make training”. Where my argument against very high data efficiency is that current long-horizon RL isn’t very data-efficient, and intentional exploration hacking seems plausible in this situation (e.g. bad analogy but for the sake of describing exploration hacking: someone is training you to solve rubik’s cubes, you can avoid succeeding at solving a rubik’s cube in the first place (and use other strategies to mess with their reward signal). This works as long as the trainers don’t know the specific steps required to solve the rubik’s cube, but even then I think there’s some strategies that could work to reduce generalization).
The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
I agree there are numerous directions of hope, but disagree that either of your examples count, if I understand them correctly. “leaning more on generalization” ≈ “leaning more on agency”, which doesn’t seem good to me. Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn’t seem like it would help with any problem that we were focused on in this post.
I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
If we are using okay interpretability tools to understand whether the AI has the goal we intended, and to guide training, then I would consider that a fundamental advance over current standard training techniques. I agree that goals would very likely be hit by some modifications during training, in combination with other changes to other parts of the algorithm. The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
I think that there is a lot of room for the evidence to be ambiguous and controversial, and for the obvious problems to look patchable. For this reason I’ve only got a little hope that people will panic at the last minute due to finally seeing the problems and start trying to solve exactly the right problems. On top of this, there’s the pressure of needing to “extract useful work to solve alignment” before someone less cautious builds an unaligned super-intelligence, which could easily lead to people seeing substantial warnings and pressing onward anyway.
Section 6:
I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
That argument isn’t really what it says on the tin, it’s saying something closer to “maybe taking over the universe is hard/unlikely and other strategies are better for achieving most goals under realistic conditions”. I buy this for many environments and levels of power, but it’s obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that’s the sort of AI we get if it can undergo self-improvement.
Overall I think your comment is somewhat representative of what I see as the dominant cluster of views currently in the alignment community. (Which seems like a very reasonable set of beliefs and I don’t think you’re unreasonable for having them).
I’ll think more about the outer shell stuff, it’s possible that my objection actually arises with the consequentialist assumption, but I’m not sure.
It’s on my todo list to write a comment responding to some of the specifics of Redwood’s control post.
I would be excited to read this / help with a draft.
Yes, approximately, as I believe you and I are capable of doing.
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It’s pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Seems easy enough to predict given roughly human-scientist-team level of capabilities.
One situation I’m imagining here is that we’ve (explicitly) trained our AI on scientific output from January to March, and we’re now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they’re not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we’ll get with AIs rather than the bar being at “human level” it’s more like “what has been accomplished by the humans+AIs in the outside world, for which I’ve received untrustworthy information”. And every time the AIs do useful work we can add this to our supervised training corpus.
> The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other ‘scalable oversight’), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn’t seem like it would help with any problem that we were focused on in this post.
Here’s a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), “We find a strong positive relationship between the LLM’s attention to constraint tokens and the factual accuracy of generations”. My interpretation is that they found a probably-causal link between ‘the model is attending to relevant tokens’ and ‘the model gets the answer correct’, because duh, it’s like how when you give somebody a riddle you can tell whether they’re on track based on what parts of the question they’re focused on. While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it’s probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
“leaning more on generalization” ≈ “leaning more on agency”,
That’s not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from “leaning more on agency”.
The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
I’m not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is “Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans”. And if we could change this AI to have the goal “Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans”, that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of ‘supervision’, thus making it easier to elicit good work from AIs (ideally not just ‘looks good’).
I think that there is a lot of room for the evidence to be ambiguous and controversial
I agree that this is a major potential problem which I think is pretty likely to occur; maybe I’ll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it’s also true that we’ll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk.
I buy this for many environments and levels of power, but it’s obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that’s the sort of AI we get if it can undergo self-improvement.
I don’t feel like I’m talking about AIs which have “taking-over-the-universe in their easily-within-reach options”. I think this is not within reach of the current employees of AGI labs, and the AIs I’m thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
I would be excited to read this / help with a draft.
We can meet in person one afternoon and work out some cruxes and write them up?
Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. [...PhD student crux]
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It’s pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Are you saying you buy my side of the PhD student crux, but that it isn’t closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn’t look like the sort of situation where the “deception” is a relatively difficult task.
I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they’re not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard.
I agree with this. But the research we need isn’t just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say “given our guesses about mainland researchers and their time constraints, we’d put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let’s {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we’re stuck, communicate it poorly}”.
I think this example is closer to what we’ll get with AIs rather than the bar being at “human level” it’s more like “what has been accomplished by the humans+AIs in the outside world, for which I’ve received untrustworthy information”.
Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren’t capable of doing with just humans. If you’re capable of doing it with humans, or if you’re close to being able to do it with humans, then why not just do that? Seems safer.
And every time the AIs do useful work we can add this to our supervised training corpus.
Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell “what time it is” by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other ‘scalable oversight’), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Mostly my objection to this is above. On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I’m assuming you’re using “GPT-5/6” to represent “some powerful future consequentialist AI that can do research”, because otherwise we should be having a different argument).
While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
It’d be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.
While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
This is very plausible, but I think you’re stretching the analogy between LMs and future research AIs way too far. “Paying attention to the tokens we want them to” doesn’t add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don’t want it paying attention to).
Similarly, I think it’s probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
I think you might be centrally thinking about an LLM completing a few paragraphs, where I’m thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the “neural correlates of X” in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn’t do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.
By leaning more on generalization, I mean leaning more on the data efficiency thing
Sorry for misinterpreting you, but this doesn’t clarify what you meant.
also weak-to-strong generalization ideas.
I think I don’t buy the analogy in that paper, and I don’t find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is “if you have a great prior, you can use it to overcome some label noise and maybe also label bias”. But I don’t think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I’m a fan of it, it’s really valuable that they explained their alignment plan concisely and listed their assumptions).
I could imagine starting with a deceptively aligned AI whose goal is “Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans”. And if we could change this AI to have the goal “Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans”, that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of ‘supervision’, thus making it easier to elicit good work from AIs (ideally not just ‘looks good’).
If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.
I don’t feel like I’m talking about AIs which have “taking-over-the-universe in their easily-within-reach options”. I think this is not within reach of the current employees of AGI labs, and the AIs I’m thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.
Seems like you have some good coherent thoughtful thoughts on this topic—are they written up somewhere? Some positive account of what you think alignment success will realistically look like, for example?
Caveat that I have only skimmed this. I don’t think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I’m not the main audience. My reactions:
Section 1: Useful tasks, like novel science, are hard—sure
Section 2: Being capable of hard tasks implies approximate consequentialism—I’m not very high confidence on this, but happy to go with it
Section 3: Hard tasks require learning new things—agree that learning will be important
Section 4: Behavioral training is an imprecise way to specify goals—most the sub-arguments have me going “shrug, maybe”. I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
Section 5: Control of misaligned AIs is difficult—This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood’s recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3).
“Misaligned goal-directed AI won’t cooperate with humans” Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it’s plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there’s definitely some other solution space here.]. I think this subsection title is possible but unlikely.
“There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it.
“It only applies to relatively easy to check research results” seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
“In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top.” I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
“It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps.” I agree that capability elicitation seems non-trivial in some cases, but I’m reasonably optimistic that future work will improve the situation.
Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
Section 6: Powerful misaligned AI would be bad—I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn’t really engage with those arguments in a way that sways me much.
Thanks for reading it, it’s good to know exactly where you think the argument is weakest and I appreciate the effort of going through and noting differences.
On section 4:
This surprises me actually, I thought this section was solid conditional on the previous assumptions. I think you shouldn’t think of them as problems that might arise in practice, instead they should be thought of as reasons why behavioral training under-specifies some parts of the AI algorithm.
I’ll give some more quick examples of “outer-shell constraints”, because examples are very common and it seems odd for this to be called speculative: Unendorsed habits or addictions in humans; the instinct to breathe in when you’re trying to hold your breath; appending a checker algorithm C to any algorithm A, to double check the output of A; any biases in a search ordering (as in the alphago example).
On section 5:
It’s on my todo list to write a comment responding to some of the specifics of Redwood’s control post. This post wasn’t intending to engage with their detailed arguments, just give the top-level reasons why we believe control is difficult.
Yes, approximately, as I believe you and I are capable of doing. The crux here might be “if we had 20 phd students, and incentivize them to both look like they are doing checkable research, and also incentivize them not to make any real useful progress on solving a particular difficult problem, would this be very useful without costing a lot of expert supervision?”. If I remember correctly, this was a crux between myself and Ryan during an in person discussion. It seems likely to me that the phd students, trying not to be useful while looking good in checkable ways, will succeed at both. More intelligent agent are even more likely to succeed at both. It does depend on the type of research, I agree that formal theorem-proving would be much easier to incentivize properly. I’m talking about the least-easily-checkable 20% of useful research.
Seems easy enough to predict given roughly human-scientist-team level of capabilities. And seems hard to get useful work out of human-scientist-team without them also working out ways to (maybe imperfectly) distinguish “faked sensory inputs” from real ones, and without leaking any information about what time it is. E.g. we need the AI to build on previous work
Nice, this is my other crux with Ryan, which we phrased roughly as “how data efficient can we make training”. Where my argument against very high data efficiency is that current long-horizon RL isn’t very data-efficient, and intentional exploration hacking seems plausible in this situation (e.g. bad analogy but for the sake of describing exploration hacking: someone is training you to solve rubik’s cubes, you can avoid succeeding at solving a rubik’s cube in the first place (and use other strategies to mess with their reward signal). This works as long as the trainers don’t know the specific steps required to solve the rubik’s cube, but even then I think there’s some strategies that could work to reduce generalization).
?? This seems to be assuming a solution to the problem.
I agree there are numerous directions of hope, but disagree that either of your examples count, if I understand them correctly. “leaning more on generalization” ≈ “leaning more on agency”, which doesn’t seem good to me. Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn’t seem like it would help with any problem that we were focused on in this post.
If we are using okay interpretability tools to understand whether the AI has the goal we intended, and to guide training, then I would consider that a fundamental advance over current standard training techniques.
I agree that goals would very likely be hit by some modifications during training, in combination with other changes to other parts of the algorithm. The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
I think that there is a lot of room for the evidence to be ambiguous and controversial, and for the obvious problems to look patchable. For this reason I’ve only got a little hope that people will panic at the last minute due to finally seeing the problems and start trying to solve exactly the right problems. On top of this, there’s the pressure of needing to “extract useful work to solve alignment” before someone less cautious builds an unaligned super-intelligence, which could easily lead to people seeing substantial warnings and pressing onward anyway.
Section 6:
That argument isn’t really what it says on the tin, it’s saying something closer to “maybe taking over the universe is hard/unlikely and other strategies are better for achieving most goals under realistic conditions”. I buy this for many environments and levels of power, but it’s obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that’s the sort of AI we get if it can undergo self-improvement.
Overall I think your comment is somewhat representative of what I see as the dominant cluster of views currently in the alignment community. (Which seems like a very reasonable set of beliefs and I don’t think you’re unreasonable for having them).
Thanks for your response!
I’ll think more about the outer shell stuff, it’s possible that my objection actually arises with the consequentialist assumption, but I’m not sure.
I would be excited to read this / help with a draft.
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It’s pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
One situation I’m imagining here is that we’ve (explicitly) trained our AI on scientific output from January to March, and we’re now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they’re not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we’ll get with AIs rather than the bar being at “human level” it’s more like “what has been accomplished by the humans+AIs in the outside world, for which I’ve received untrustworthy information”. And every time the AIs do useful work we can add this to our supervised training corpus.
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other ‘scalable oversight’), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Here’s a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), “We find a strong positive relationship between the LLM’s attention to constraint tokens and the factual accuracy of generations”. My interpretation is that they found a probably-causal link between ‘the model is attending to relevant tokens’ and ‘the model gets the answer correct’, because duh, it’s like how when you give somebody a riddle you can tell whether they’re on track based on what parts of the question they’re focused on. While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it’s probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
That’s not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from “leaning more on agency”.
I’m not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is “Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans”. And if we could change this AI to have the goal “Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans”, that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of ‘supervision’, thus making it easier to elicit good work from AIs (ideally not just ‘looks good’).
I agree that this is a major potential problem which I think is pretty likely to occur; maybe I’ll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it’s also true that we’ll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk.
I don’t feel like I’m talking about AIs which have “taking-over-the-universe in their easily-within-reach options”. I think this is not within reach of the current employees of AGI labs, and the AIs I’m thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
We can meet in person one afternoon and work out some cruxes and write them up?
Are you saying you buy my side of the PhD student crux, but that it isn’t closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn’t look like the sort of situation where the “deception” is a relatively difficult task.
I agree with this. But the research we need isn’t just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say “given our guesses about mainland researchers and their time constraints, we’d put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let’s {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we’re stuck, communicate it poorly}”.
Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren’t capable of doing with just humans. If you’re capable of doing it with humans, or if you’re close to being able to do it with humans, then why not just do that? Seems safer.
Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell “what time it is” by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).
Mostly my objection to this is above. On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I’m assuming you’re using “GPT-5/6” to represent “some powerful future consequentialist AI that can do research”, because otherwise we should be having a different argument).
It’d be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.
This is very plausible, but I think you’re stretching the analogy between LMs and future research AIs way too far. “Paying attention to the tokens we want them to” doesn’t add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don’t want it paying attention to).
I think you might be centrally thinking about an LLM completing a few paragraphs, where I’m thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the “neural correlates of X” in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn’t do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.
Sorry for misinterpreting you, but this doesn’t clarify what you meant.
I think I don’t buy the analogy in that paper, and I don’t find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is “if you have a great prior, you can use it to overcome some label noise and maybe also label bias”. But I don’t think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I’m a fan of it, it’s really valuable that they explained their alignment plan concisely and listed their assumptions).
If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.
Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.
There are enough open threads that I think we’re better off continuing this conversation in person. Thanks for your continued engagement.
Seems like you have some good coherent thoughtful thoughts on this topic—are they written up somewhere? Some positive account of what you think alignment success will realistically look like, for example?