ThomasCederborg comments on Managing risks while trying to do good

ThomasCederborg 2 Feb 2024 4:19 UTC
1 point
0
It is getting late here, so I will stop after this comment, and look at this again tomorrow (I’m in Germany). Please treat the comment below as not fully thought through.
The problem from my perspective, is that I don’t think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don’t think that it is a good idea, to do what an abstract entity, called ``humanity″, wants (and I think that this is true, from the perspective of essentially any human individual). I think that it would be rational, for essentially any human individual, to strongly oppose the launch of any such ``Group AI″. Human individuals, and groups, are completely different types of things. So, I don’t think that it should be the surprising, to learn that doing what a group wants, is bad for the individuals, in that group. This is a separate issue, from problems related to optimising for a proxy.
I give one example, of how things can go wrong, in the post:
A problem with the most recently published version of CEV
This is of course just one specific example, and it is meant as an introduction, to the dangers, involved in building an AI, that is describable as ``doing what a group wants″. Showing that a specific version of CEV, would lead to an outcome, that is far, far, worse than extinction, does not, on its own, prove that all versions of CEV are dangerous. I do however think that all versions of CEV, are, very, very, dangerous. And I do think, that this specific thought experiment, can be used to hint at a more general problem. I also hope, that this thought experiment will at least be sufficient, for convincing most readers that there, might, exist a deeper problem, with the core concept. In other words, I hope that it will be sufficient, to convince most readers that you, might, be going after the wrong objective, when you are analysing different attempts ``to say what CEV is″.
While I’m not actually talking about implementation, perhaps it would be more productive, to approach this from the implementation angle. How certain are you, that the concept of Boundaries / Membranes, provides reliable safety, for individuals, from a larger group, that contains the type of fanatics, described in the linked post? Let’s say that it turns out, that they do not, in fact, reliably provide such safety, for individuals. How certain are you then, that the first implemented system, that relies on Boundaries / Membranes, to protect individuals from such groups, will in fact result, in you being able to try again? I don’t think that you can possibly know this, with any degree of certainty. (I’m certainly not against safety measures. If anyone attempts to do what you are describing, then I certainly hope that this attempt will involve safety measures) (I also have nothing against the idea of Boundaries / Membranes)
An alternative (or parallel) path, to trial and error, is to try to make progress on the ``what alignment target should be aimed at?″ question. Consider what you would say to Bob, who wants to build a Suffering Reducing AI (SRAI). He is very uncertain of his definition of ``Suffering″, and he is implementing safety systems. He knows that any formal definition of ``Suffering″ that he can come up with, will be a proxy, for the actually, correct, definition of Suffering. If it can be shown, that some specific implementation of SRAI, would lead to a bad outcome (such as an AI, that decides to kill everyone), then Bob will respond that the definition of Suffering, must be wrong (and that he has prepared safety systems, that will let him try to find a better definition of ``Suffering″).
This might certainly end well. Bob’s safety systems might continue to work, until Bob realises, that the core idea, of building any AI, that is describable as a SRAI, will always lead to an AI, that simply kills everyone (in other words: until he realises, that he is going after the wrong objective). But I would say, that a better alternative, is to make enough progress, on the ``what alignment target should be aimed at?″ question, that it is possible to explain to Bob, that he is, in fact, going after the wrong objective (and is not, in fact, dealing with proxy issues). (in the case of SRAI, such progress has off course been around for a while. I think I remember reading an explanation of the ``SRAI issue″, written by Yudkowsky, decades ago. So, to deal with people like Bob, there is no actual need, for us, to make additional progress. But for people in a world where SRAI, is the state of the art, in terms of answering the ``what alignment target should be aimed at?″ question, I would advice them to focus on making further progress, on this question)
Alternatively, I could ask what you would say to Bob, if he thinks that ``reducing Suffering″, is ``the objectively correct thing to do″, and is convinced, that any implementation that leads to bad outcomes (such as an AI, that kills everyone), must be a proxy issue? I think that, just as any reasonable definition of ``Suffering″, implies a SRAI, that kills everyone, any reasonable set of definitions of ``a Group″, implies a Group AI, that is bad for human individuals (in expectation, when that Group AI is pointed at billions of humans, from the perspective of essentially any human individual, in the set of humans, that the Group AI is pointed at, compared to extinction). In other words, a Group AI is bad for human individuals in expectation, in the same sense as a SRAI kills everyone. I’m definitely not saying that this is true for ``minds in general″. If Dave is able to reliably see all implications of any AI proposal (or if Dave is invulnerable to a powerful AI that is trying to hurt Dave. Or if the minds that the Group AI will be pointed at, are known to be ``friendly towards Dave″ in some formal sense, that is fully understood by Dave), then this might not be true for Dave. But I claim that it is true for human individuals.