Thanks for your response.
The AI can quickly assess the “forcefulness” of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.
So, I think this touches on the difficult part. As humans, we have a good idea of what “giving choices to people” vs. “forcing them to do something” looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the “forceful” category (even though it can be done with only text). A sufficiently advanced AI’s concept space might contain a similar concept. But how do we pinpoint this concept in the AI’s concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the “giving choices to people” vs. “forcing them to do something” distinction on multiple examples, but are different in important ways. We need to pinpoint it in order to make this concept part of the AI’s decision-making procedure.
It will also be able to model people (as it must be able to do, because all intelligent systems must be able to model the world pretty accurately or they don’t qualifiy as ‘intelligent’) so it will probably have a pretty shrewd idea already of whether people will react positively or negatively toward some intended action plan.
This seems pretty similar to Paul’s idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we’ll probably get AGI before we get uploads, so we’ll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we’ll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it’s a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.
In all of that procedure I just described, why would the explanation of the plans to the people be problematic? People will ask questions about what the plans involve. If there is technical complexity, they will ask for clarification. If the plan is drastic there will be a world-wide debate, and some people who finds themselves unable to comprehend the plan will turn to more expert humans for advice.
What language will people’s questions about the plans be in? If it’s a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this. If it’s a more technical language, then humans themselves must be able to look at the AI’s concept space and understand it. Whether this is possible very much depends on how transparent the AI’s concept space is. Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.
We should also definitely be wary of a decision rule of the form “find a plan that, if explained to humans, would cause humans to say they understand it”. Since people are easy to manipulate, raw optimization for this objective will produce psychologically manipulative plans that people will incorrectly approve of. There needs to be some way to separate “optimize for the plan being good” from “optimize for people thinking the plan is good when it is explained to them”, or else some way of ensuring that humans’ judgments about these plans are accurate.
Again, it’s quite plausible that the AI’s concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI’s concept space in order to pinpoint this concept so it can be integrated into the AI’s decision rule.
I should mention that I don’t think that these black-box approaches to AI control are necessarily doomed to failure; rather, I’m pointing out that there are lots of unresolved gaps in our knowledge of how they can be made to work, and it’s plausible that they are too difficult in practice.
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I’m not sure if this is what you mean by “normatively correct”, but it seems like a plausible concept that multiple concept learning algorithms might converge on. I’m still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it’s probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.