It seems like it would be really easy to come up with a lot of moral questions and answers and then ask an AI to tell us what it predicts humans preferring as an outcome.
There’s a possibility that AI is not good at modeling human preferences, but if that’s the case, it’ll be very apparent at lower levels because that will mean commands will have to be very specific to get results. Any model that can’t answer basic questions about it’s intended goals is not going to be given the (metaphorical) nuclear codes.
In fact, why wouldn’t you just test every AI by asking it to explain how it’s going to solve your problem before it actually solves it?
How do I die?
This article (by Eliezer Yudkowsky) explains why the suggestion in your 2nd paragraph won’t work: https://arbital.com/p/goodharts_curse/
I’m afraid I’ll butcher the argument in trying to summarize, but essentially it is because even slight misalignments will get blown up (i.e. it will pursue the areas where it is misaligned at the expense of everything else) at increasing optimization pressure. So you might have something aligned fairly well, and at the test optimization level, you can check that it is indeed aligned pretty well—but then when you turn up the pressure, it will find weaker points in the specification and optimize for that instead. And this problem recurs at the meta-level, so there’s not an obvious way to say “well obviously just don’t do that” in a way that would actually work.
The problem with asking the AI how it will solve your problem is that if it is misaligned, it will just lie to you if that helps it complete its objective more effectively.
I think there may have been some miscommunication here, either I’m not understanding you or you’re not understanding me, so I’ll explain my second paragraph point in a different way in case it was my mistake.
My model is that at lower levels of AI, ‘misalignment’ will be measurable but not catastrophic. It would look like producing an advertising campaign that is funny but does not feature the product, or a tool that is very cheap but very useless. Any misunderstanding of human preferences will lead to failure, so either humans will improve their ability to understand and communicate with AI, or vice versa, but it will happen by necessity of getting AI to do anything.
That’s why you make its goal to communicate what it would do, not to actually do the thing. Which I guess is based on the assumptions that oracular AI can be created, and we have enough time to build the superintelligence, and then use the superintelligence to align itself.
On a side note, I enjoyed the article. My answer to the problem would be more testing, more training, and always using hypotheticals. I don’t see why you couldn’t ask an AI to predict what it would do if you gave it a particular goal system and let it out of the box.
Also, I don’t think it’s an argument against any one theory of AI, it seems like a general problem “AI can misunderstand humans and their reactions.” That problem can be mitigated, prevented, and backdoored, but it doesn’t seem like the problem or solutions differ among AI systems, no? If EY’s style of alignment wouldn’t have that problem, I might need it explained to me how that would be the case.
I think the question of you/Adele miscommunicating is mostly under-specification of what features you want your test-AGI to have.
If you throttle its ability to optimize for its goals, see EY and Adele’s arguments.
If you don’t throttle in this way, you run into goal-specification/constraint-specification issues, instrumental convergence concerns and everything that goes along with it.
I think most people here will strongly feel a (computationally) powerful AGI with any incentives is scary, and that any test-versions should require using at-most a much-less-powerful one.
Sorry if I’ve misunderstood you at all. If you specify the nature of/goals/constraints etc of your test-AI more specifically, maybe I or someone else can try to give you more specific failure-modes.