I think that something like winner-take-all is somewhat implausible, but an oligopoly is quite likely IMO, and to a certain extent I think both Eliezer Yudkowsky and Robin Hanson plausibly got this wrong.
Though I wouldn’t rule out something like a winner take all, so I don’t fully buy my own claim in the first sentence.
I think some of my cruxes for why I’m way less doomy than Eliezer, and overall far less doomy than LWers in general are the following:
I believe we already have a large portion of the insights needed to create safe superintelligence today, and while they require a lot of work to be done, it’s the kind of work that can be done with both money and time, rather than “a new insight is required.”
Eliezer in his current blackpilled phase, speaks of a “textbook from the future” in which the presently unknown theory of safe creation of superintelligence is spelt out, as something it would take decades to figure out; also, that obtaining the knowledge in it would be surrounded with peril, as one cannot create superintelligence with the “wrong” values, learn from the mistake, and then start over.
I’d say the biggest portions of the theory is the following:
Contra Nate Soares, alignment generalizes further than capabilities, for some of the following reasons:
It is way easier to judge whether your values are satisfied than to actually enact on your values, and more generally the pervasive gap between verifying and generation is very helpful for reward models trying to generalize out of distribution.
Value learning comes mostly for free with enough data, and more generally values are simpler and easier to learn than capabilities, which leads to the next point:
Contra evopsych, values aren’t nearly as complicated and fragile as we thought 15-20 years ago, and more importantly, depend more on data than evopsych thought it did, so it means that what values an AI has is strongly affected by what data it received in the past.
Alignment is greatly helped by synthetic data, since we can generate datasets of human values in a wide variety of circumstances, and importantly test this in simulated worlds slightly different from our own:
For values, this arises because we can create large datasets showcasing the ‘correct’ values we want our AI to understand in a wide variety of circumstances and can iteratively refine this by testing the models’ generalization of these values to novel situations. Moreover, we can prevent any ‘value hijacking’ of the model since it will not understand or be able to represent any new or different values.
I also think synthetic data will be hugely important to capabilities progress going forward, so synthetic data alignment has lower taxes than other solutions to alignment.
I think that something like winner-take-all is somewhat implausible, but an oligopoly is quite likely IMO, and to a certain extent I think both Eliezer Yudkowsky and Robin Hanson plausibly got this wrong.
Though I wouldn’t rule out something like a winner take all, so I don’t fully buy my own claim in the first sentence.
I think some of my cruxes for why I’m way less doomy than Eliezer, and overall far less doomy than LWers in general are the following:
I believe we already have a large portion of the insights needed to create safe superintelligence today, and while they require a lot of work to be done, it’s the kind of work that can be done with both money and time, rather than “a new insight is required.”
I’d say the biggest portions of the theory is the following:
Contra Nate Soares, alignment generalizes further than capabilities, for some of the following reasons:
It is way easier to judge whether your values are satisfied than to actually enact on your values, and more generally the pervasive gap between verifying and generation is very helpful for reward models trying to generalize out of distribution.
Value learning comes mostly for free with enough data, and more generally values are simpler and easier to learn than capabilities, which leads to the next point:
Contra evopsych, values aren’t nearly as complicated and fragile as we thought 15-20 years ago, and more importantly, depend more on data than evopsych thought it did, so it means that what values an AI has is strongly affected by what data it received in the past.
Cf here:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Alignment is greatly helped by synthetic data, since we can generate datasets of human values in a wide variety of circumstances, and importantly test this in simulated worlds slightly different from our own:
I also think synthetic data will be hugely important to capabilities progress going forward, so synthetic data alignment has lower taxes than other solutions to alignment.
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
These are the big insights that I think we have right now, and I think they’re likely enough to enable safe paths to superintelligence.