For context: Are you part of that alignment effort?
I could easily imagine that changes to the pre-training regime can lead to more robust agents with less of the obvious superficial failure modes. Naively, it also does make sense that it moves us into a regime that appears strictly safer, than doing unconstrained pretraining and then doing RLHF. I don’t see how it reliably tells us anything about how things generalize to a system that is capable of coherent reasoning that is not necessarily legible to us.
I.e. I don’t see how to update at all away from the (alignment-technical) pessimistic scenario. I could see how it might help move us away from a maximally socially pessimistic scenario, i.e. one where the techniques that we pursue seem to aggressively optimize for deception and try to fix failure modes only after they have already appeared.
For context: Are you part of that alignment effort?
No.
I don’t see how it reliably tells us anything about how things generalize to a system that is capable of coherent reasoning that is not necessarily legible to us.
I think it does generalize pretty straightforwardly, since it attacks core problems of alignment like Goodhart’s law, Deceptive Alignment and misaligned power seeking. In the Pretraining from Human Feedback work, they’ve completely solved or almost completely solved the deceptive alignment problem, solved the most severe versions of Goodhart’s law by recreating Cartesian boundaries that work in the embedded world, and showed that as you give it more data (which is a kind of capabilities increase), that misalignment decreases, which is tentative evidence that there’s a coupling of alignment and capabilities, where increasing capabilities leads to increasing alignment.
It also has a very small capabilities tax.
In particular, this is a huge blow in that under the pessimistic view of AI Alignment, such a breakthrough of alignment via empiricism wouldn’t happen, or at least not without radical change, let alone the number of breakthroughs that the Pretraining from Human Feedback work showed.
Meta: One reason I’m so optimistic is because I believe there’s a serious, pernicious bias to emphasize negativity in the news, so I’m giving negative updates higher burdens of proofs, or equivalently lowering the burden of proof for positive updates.
Has anyone tried to point out expected failure modes of that approach (beyond the general “we don’t know what happens when capabilities increase” that I was pointing at)? I’ll admit I don’t understand the details enough right now to say anything, but it seems worth to look at!
I’m not sure I can follow your Meta-reasoning. I agree that news are overly focused on current problems, but I don’t really see how that applies to AI alignment (except maybe as far as bias etc. are concerned). Personally, I try to go by who has the most logically inevitable-seeming chains of reasoning.
Has anyone tried to point out expected failure modes of that approach (beyond the general “we don’t know what happens when capabilities increase” that I was pointing at)?
Not right now, though more work is necessary in order to show that the improving alignment as it improves in other capabilities other than data. But it’s likely the only shortcoming of the paper.
Personally, I expect that Pretraining from Human Feedback will generalize to other capabilities and couple capabilities and alignment together.
I’m not sure I can follow your Meta-reasoning. I agree that news are overly focused on current problems, but I don’t really see how that applies to AI alignment (except maybe as far as bias etc. are concerned). Personally, I try to go by who has the most logically inevitable-seeming chains of reasoning.
While logic and evidence do matter, my point is that there’s an issue where there’s a general bias towards the negative view of things, since we both like it and the news serves us up more negative views.
This has implications for arguably everything, including X-risk: The major implication is that we should differentially distrust negative updates over positive updates, and thus we should expect to reliably predict that things are better than they seem.
For context: Are you part of that alignment effort?
I could easily imagine that changes to the pre-training regime can lead to more robust agents with less of the obvious superficial failure modes. Naively, it also does make sense that it moves us into a regime that appears strictly safer, than doing unconstrained pretraining and then doing RLHF. I don’t see how it reliably tells us anything about how things generalize to a system that is capable of coherent reasoning that is not necessarily legible to us.
I.e. I don’t see how to update at all away from the (alignment-technical) pessimistic scenario. I could see how it might help move us away from a maximally socially pessimistic scenario, i.e. one where the techniques that we pursue seem to aggressively optimize for deception and try to fix failure modes only after they have already appeared.
No.
I think it does generalize pretty straightforwardly, since it attacks core problems of alignment like Goodhart’s law, Deceptive Alignment and misaligned power seeking. In the Pretraining from Human Feedback work, they’ve completely solved or almost completely solved the deceptive alignment problem, solved the most severe versions of Goodhart’s law by recreating Cartesian boundaries that work in the embedded world, and showed that as you give it more data (which is a kind of capabilities increase), that misalignment decreases, which is tentative evidence that there’s a coupling of alignment and capabilities, where increasing capabilities leads to increasing alignment.
It also has a very small capabilities tax.
In particular, this is a huge blow in that under the pessimistic view of AI Alignment, such a breakthrough of alignment via empiricism wouldn’t happen, or at least not without radical change, let alone the number of breakthroughs that the Pretraining from Human Feedback work showed.
Meta: One reason I’m so optimistic is because I believe there’s a serious, pernicious bias to emphasize negativity in the news, so I’m giving negative updates higher burdens of proofs, or equivalently lowering the burden of proof for positive updates.
Has anyone tried to point out expected failure modes of that approach (beyond the general “we don’t know what happens when capabilities increase” that I was pointing at)?
I’ll admit I don’t understand the details enough right now to say anything, but it seems worth to look at!
I’m not sure I can follow your Meta-reasoning. I agree that news are overly focused on current problems, but I don’t really see how that applies to AI alignment (except maybe as far as bias etc. are concerned). Personally, I try to go by who has the most logically inevitable-seeming chains of reasoning.
Not right now, though more work is necessary in order to show that the improving alignment as it improves in other capabilities other than data. But it’s likely the only shortcoming of the paper.
Personally, I expect that Pretraining from Human Feedback will generalize to other capabilities and couple capabilities and alignment together.
While logic and evidence do matter, my point is that there’s an issue where there’s a general bias towards the negative view of things, since we both like it and the news serves us up more negative views.
This has implications for arguably everything, including X-risk: The major implication is that we should differentially distrust negative updates over positive updates, and thus we should expect to reliably predict that things are better than they seem.
Here’s the link for the issue of negativity bias:
https://www.vox.com/the-highlight/23596969/bad-news-negativity-bias-media