A visualization that I like is imagining a small group of, say, five humans selected by various governments for being wise, benevolent, and stable.
I think this might be a dealbreaker. I don’t trust the world’s governments to come up with 5 humans who are sufficient wise, benevolent, and stable. (Do you really?) I’m not sure I can come with 5 such people myself. None of the alternatives you talk about seem acceptable either.
I think maybe a combination of two things could change my mind, but both seem very hard and have close to nobody working on them:
The AI is very good at helping the principles be wise and stable, for example by being super-competent at philosophy. (I think this may also require being less than maximally corrigible, but I’m not sure.) Otherwise what happens if, e.g., the principles or AI start thinking about distant superintelligences?
There is some way to know that benevolence is actually the CEV of such a group, i.e., they’re not just “deceptively aligned”, or something like that, while not having much power.
I think that there are small groups of people that can form a sufficiently wise consensus that I would entrust them to govern a corrigible ASI. I don’t think I, personally, could do a good job right this moment, not having spent much time specializing in knowing/finding such people. But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.
I do not trust the world’s governments to agree on anything, much less something as contentious as this. It seems pretty plausible that the 20th century was our shot at forming the world government needed for this task and we botched it. That said, I try not to let the perfect be the enemy of the good or assume that things can’t get better and thus self-fulfill that fate. We are, in a sense, in a coordination problem in selecting governors for the future, and it seems vital to note how important it is that we get that right.
If you’re correct that we’re not going to get acceptably wise principals, which I think is very plausible, then that is indeed a dealbreaker on this path. If so, I think our only recourse is to shut down all capabilities research until humanity gets its act together. This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.
But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.
Want to explain a bit about how you’d go about doing this? Seems like you’re facing some similar problems as assuring that an AI is wise, benevolent, and stable, e.g., not knowing what wisdom really is, distribution shift between testing and deployment, adversarial examples/inputs.
This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.
I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you “like” the visualization of 5 humans selected by various governments, without mentioning that you don’t trust governments to do this, which seems like a mistake?
Want to explain a bit about how you’d go about doing this?
I don’t think there’s a particular trick, here. I can verify a certain amount of wisdom, and have already used that to gain some trust in various people. I’d go to the people I trust and ask them how they’d solve the problem, then try to spot common techniques and look for people who were pointed to independently. I’d attempt to get to know people who were widely seen as trustworthy and understand why they had that reputation and try not to get Goodharted too hard. I’d try to get as much diversity as was reasonable while also still keeping the quality bar high, since diverse consensus is more robust than groupthink consensus. I’d try to select for old people who seem like they’ve been under intense pressure and thrived without changing deeply as people in the process. I’d try to select for people who were capable of cooperating and changing their minds when confronted by logic. I’d try to select for people who didn’t have much vested interest, and seemed to me, in the days I spent with them, to be focused on legacy, principles, and the good of the many.
To be clear, I don’t think I could reliably pull this off if people were optimizing for manipulating, deceiving, and pressuring me. :shrug:
I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you “like” the visualization of 5 humans selected by various governments, without mentioning that you don’t trust governments to do this, which seems like a mistake?
I agree that false hope is a risk. In these documents I’ve tried to emphasize that I don’t think this path is easy. I feel torn between people like you and Eliezer who take my tone as being overly hopeful and the various non-doomers who I’ve talked to about this work who see me as overly doomy. Suggestions welcome.
I said I like the visualization because I do! I think I’d feel very happy if the governments of the world selected 5 people on the basis of wisdom and sanity to be the governors of AGI and the stewards of the future. Similarly, I like the thought of an AGI laboratory doing a slow and careful training process even when all signs point to the thing being safe. I don’t trust governments to actually select stewards of the future just as I don’t expect frontier labs to go slow and be sufficiently careful. But having strong conceptualizations of what success might look like is integral, I think, to actually succeeding.
I think this might be a dealbreaker. I don’t trust the world’s governments to come up with 5 humans who are sufficient wise, benevolent, and stable. (Do you really?) I’m not sure I can come with 5 such people myself. None of the alternatives you talk about seem acceptable either.
I think maybe a combination of two things could change my mind, but both seem very hard and have close to nobody working on them:
The AI is very good at helping the principles be wise and stable, for example by being super-competent at philosophy. (I think this may also require being less than maximally corrigible, but I’m not sure.) Otherwise what happens if, e.g., the principles or AI start thinking about distant superintelligences?
There is some way to know that benevolence is actually the CEV of such a group, i.e., they’re not just “deceptively aligned”, or something like that, while not having much power.
I think that there are small groups of people that can form a sufficiently wise consensus that I would entrust them to govern a corrigible ASI. I don’t think I, personally, could do a good job right this moment, not having spent much time specializing in knowing/finding such people. But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.
I do not trust the world’s governments to agree on anything, much less something as contentious as this. It seems pretty plausible that the 20th century was our shot at forming the world government needed for this task and we botched it. That said, I try not to let the perfect be the enemy of the good or assume that things can’t get better and thus self-fulfill that fate. We are, in a sense, in a coordination problem in selecting governors for the future, and it seems vital to note how important it is that we get that right.
If you’re correct that we’re not going to get acceptably wise principals, which I think is very plausible, then that is indeed a dealbreaker on this path. If so, I think our only recourse is to shut down all capabilities research until humanity gets its act together. This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.
Want to explain a bit about how you’d go about doing this? Seems like you’re facing some similar problems as assuring that an AI is wise, benevolent, and stable, e.g., not knowing what wisdom really is, distribution shift between testing and deployment, adversarial examples/inputs.
I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you “like” the visualization of 5 humans selected by various governments, without mentioning that you don’t trust governments to do this, which seems like a mistake?
I don’t think there’s a particular trick, here. I can verify a certain amount of wisdom, and have already used that to gain some trust in various people. I’d go to the people I trust and ask them how they’d solve the problem, then try to spot common techniques and look for people who were pointed to independently. I’d attempt to get to know people who were widely seen as trustworthy and understand why they had that reputation and try not to get Goodharted too hard. I’d try to get as much diversity as was reasonable while also still keeping the quality bar high, since diverse consensus is more robust than groupthink consensus. I’d try to select for old people who seem like they’ve been under intense pressure and thrived without changing deeply as people in the process. I’d try to select for people who were capable of cooperating and changing their minds when confronted by logic. I’d try to select for people who didn’t have much vested interest, and seemed to me, in the days I spent with them, to be focused on legacy, principles, and the good of the many.
To be clear, I don’t think I could reliably pull this off if people were optimizing for manipulating, deceiving, and pressuring me. :shrug:
I agree that false hope is a risk. In these documents I’ve tried to emphasize that I don’t think this path is easy. I feel torn between people like you and Eliezer who take my tone as being overly hopeful and the various non-doomers who I’ve talked to about this work who see me as overly doomy. Suggestions welcome.
I said I like the visualization because I do! I think I’d feel very happy if the governments of the world selected 5 people on the basis of wisdom and sanity to be the governors of AGI and the stewards of the future. Similarly, I like the thought of an AGI laboratory doing a slow and careful training process even when all signs point to the thing being safe. I don’t trust governments to actually select stewards of the future just as I don’t expect frontier labs to go slow and be sufficiently careful. But having strong conceptualizations of what success might look like is integral, I think, to actually succeeding.
deluding yourself about the behavior of organizations is a dominated strategy.