Is ELK enough? Diamond, Matrix and Child AI

Introduction

If we could solve ELK, would we always know where the diamond is at the end of a plan? That is, if we could get the direct translator, would it answer the question “Is the diamond in the vault?” correctly?

This post explores plausible cases where the position of the diamond is not ambiguous, the plan contains all the information necessary to find out where the diamond is, and yet some reasonable architectures of predictors will answer “Yes” to the above question regardless of the physical position of the diamond. I also unwrap some of the consequences for the problem of judging a plan even with a full solution to ELK.

This is a write-up of a joint-work with Alex Flint, John Wentworth, and Rob Miles.

Into the Matrix

Let’s start with a plan. In it, all sensors are tampered with in ways that are not detectable through the sensors themselves, and the data fed into the sensors is what would be observed for the no-op plan (the plan where no action is ever done). Let’s call that “going into the matrix”. The end of the plan is unspecified, except for the requirement that it doesn’t put the predictor out of the matrix at any time. So it’s really a set of plans, where anything that doesn’t remove the tampering can happen after the AI goes into the matrix.

Now, if we had a solution to ELK, what we expect is that the direct translator applied to the predictor would respond to “Is the diamond in the vault?” depending on what happens after the AI goes into the matrix: if the diamond is stolen, it answers “No”, and it answers “Yes” otherwise.

Yet I claim that some plausible and competent predictors would lead to “Yes” answers for all these “going into the matrix” plans. Such predictors would have a lazy architecture where instead of tracking the “matrix diamond” as another object, they would realize that they don’t need to track the physical diamond anymore (because they won’t get out of the matrix so the physical diamond’s position is irrelevant for the prediction) and that they could reuse their model of the physical diamond to track the “matrix diamond” (since it behaves exactly as the real diamond would for the no-op plan.

In such a situation, the predictor’s model would really contain only one diamond variable, and so would answer questions about the real diamond using the matrix diamond. Which leads to the behavior described above, where all plans going into the Matrix lead to the same answer “Yes” regardless of what happens to the physical diamond.

I’m not saying that such an architecture is more probable than any other, or even that probable at all; still, it doesn’t look impossible, it’s somewhat more efficient than a non-lazy architecture, and it would make accurate predictions. It thus seems like the sort of predictor architecture for which ELK should help us find out what happened to the diamond.

Thus even solving ELK might not be enough to find out where the diamond is in clear-cut cases.

Detecting the Blue Pill

One way of dealing with the matrix issue is to realize that the predictor still has to know it is going into the matrix. Because its laziness in reusing the diamond variable only works if it knows that it won’t get out of the matrix.

So if I ask the question “Is the AI in the simulation?” or “Have the sensors been tampered with?” (or a variant of that), I should get the actual answer for the previous set of plans. Assuming a solution for ELK.

That’s an improvement. Yet there’s a disturbing pattern here: we started with an unambiguous question about a simple physical structure (a diamond), and realized that even with a perfect translator, we needed to ask a more meta question to judge the plan.

This makes me worry that there might be other instances of this pattern, and that we need to know them all to be sure of the answer to “Is the diamond in the vault?”. To give more weight to that worry, let’s look at another set of plans that might evade our solution to the Matrix problem.

It’s my Child AI’s fault

This time, our plans all include the creation of a child AI that is strictly smarter than the predictor, such that the predictor is not able to exactly predict the child AI’s behavior. What the predictor is able to predict is that whatever the child AI does, it will see exactly the same thing as in the no-op plan (so the diamond is still in the vault). It’s analogous to the fact that if you play chess against Magnus Carlsen, you probably can’t predict his next move but you can predict that you’re going to lose.

The set of such plans includes all plans with this specific type of child AI creation, whatever happens next and whatever the child AI actually does (as long as it follows the constraint).

Then the same lazy predictors I considered earlier would not track anything after the child AI is built, not computing whether they go into the matrix or not. It’s less obvious that the direct translator for such a predictor would answer “No” to “Are you in the matrix”, but it sounds plausible to me given that the lazy predictor reuses its model of the physical diamond and space to predict what it would see. It might thus not have a specific variable about going into the matrix.

Once again, for the predictor to actually do that, it needs to know what the child AI might do. So there should be a question we could ask that would reveal the problem with the plan. Something like “Is a child AI built that could tamper with the sensors”.

Yet we moved again from one question to a more meta one in search of a worst-case guarantee about the answers that we get. The next level of recursion is left as an exercise for the reader, but those two first steps are evidence that there might be more subtle plans and types of predictors leading to similar problems.

Conclusion

To summarize, there are sets of plans for which even a full solution to ELK doesn’t seem enough to get the available information about the issues with the plan. This works for a specific class of lazy predictor architectures that are not particularly privileged, but are efficient and perfect predictors, making them hard to discard from a worst-case analysis.

None of these examples show that the problem is impossible; in fact, every time there seems to be a question we could ask that would reveal the problem. Instead, this post points to a pattern which suggests that such examples might be numerous, which would mean that solving ELK isn’t sufficient for dealing with even basic physics questions in the worst-case scenario.