How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like “do what I intend” and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you’ll get, at best, an AI that doesn’t know that it is deceiving you (for whatever operationalization of “know” you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards “AIs that don’t know that they are being deceptive”.
Our very broad hope is to use ELK to select actions that (i) keep humans safe, and give them time and space to evolve according to their current (essentially local) preferences, (ii) are expected to produce outcomes that would be judged favorably by the future humans, primarily by maximizing option value until it becomes clear what those future humans want (see the strategy stealing assumption).
This is discussed very briefly in this appendix of the ELK report and the subsequent appendix. There are two or three big foreseeable difficulties with this approach and likely a bunch of other problems.
I don’t think this should be particularly persuasive, but it hopefully illustrates how ARC is currently thinking about this part of the problem. Overall my current view is that this is fairly unlikely to be the weakest link in the plan, i.e. if it doesn’t work it will be because of a failure at an earlier step, and so it’s not one of the main things I’m thinking about.
How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like “do what I intend” and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you’ll get, at best, an AI that doesn’t know that it is deceiving you (for whatever operationalization of “know” you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards “AIs that don’t know that they are being deceptive”.
Our very broad hope is to use ELK to select actions that (i) keep humans safe, and give them time and space to evolve according to their current (essentially local) preferences, (ii) are expected to produce outcomes that would be judged favorably by the future humans, primarily by maximizing option value until it becomes clear what those future humans want (see the strategy stealing assumption).
This is discussed very briefly in this appendix of the ELK report and the subsequent appendix. There are two or three big foreseeable difficulties with this approach and likely a bunch of other problems.
I don’t think this should be particularly persuasive, but it hopefully illustrates how ARC is currently thinking about this part of the problem. Overall my current view is that this is fairly unlikely to be the weakest link in the plan, i.e. if it doesn’t work it will be because of a failure at an earlier step, and so it’s not one of the main things I’m thinking about.