I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?
danieldewey
If I read Paul’s post correctly, ALBA is supposed to do this in theory—I don’t understand the theory/practice distinction you’re making.
I’m not sure you’ve gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA.
As I understand it, ALBA proposes the following process:
H trains A to choose actions that would get the best immediate feedback from H. A is benign (assuming that H could give not-catastrophic immediate feedback for all actions and that the learning process is robust). H defines the feedback, and so A doesn’t make decisions that are more effective at anything than H is; A is just faster.
A (and possibly H) is (are) used to define a slow process A+ that makes “better” decisions than A or H would. (Better is in quotes because we don’t have a definition of better; the best anyone knows how to do right now is look at the amplification process and say “yep, that should do better.”) Maybe H uses A as an assistant, maybe a copy of A breaks down a decision problem into parts and hands them off to other copies of A, maybe A makes decisions that guide a much larger cognitive process.
The whole loop starts over with A+ used as H.
The claim is that step 2 produces a system that is able to give “better” feedback than the human could—feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting, etc. This should make it able to handle circumstances further and further outside human-ordinary, eventually scaling up to extraordinary circumstances. So, while you say that the best case to hope for is , it seems like ALBA is claiming to do more.
A second objection is that while you call each a “reward function”, each system is only trained to take actions that maximize the very next reward they get (not sum of future rewards). This means that each system is only effective at anything insofar as the feedback function it’s maximizing at each step considers the long-term consequences of each action. So, if , we don’t have reason to think that the system will be competent at anything outside of the “normal circumstances + a few exceptions” you describe—all of its planning power comes from , so we should expect it to be basically incompetent where is incompetent.
FWIW, this also reminded me of some discussion in Paul’s post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.:
The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.
I’m not sure if that’s relevant to your point, but it seemed like you might be interested.
Discussed briefly in Concrete Problems, FYI: https://arxiv.org/pdf/1606.06565.pdf
This is a neat idea! I’d be interested to hear why you don’t think it’s satisfying from a safety point of view, if you have thoughts on that.
Thanks for writing this, Jessica—I expect to find it helpful when I read it more carefully!
Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of “figuring out how to think about big / self-like hypotheses”. Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?
Thanks Jessica. This was helpful, and I think I see more what the problem is.
Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It’s clearly not possible for the system to hold and update infinitely many hypotheses the way Solomonoff does, and a system would need some kind of logical uncertainty or other magic to evaluate complex or self-referential hypotheses, but it seems like these hypotheses should be “in its class”. Does this make sense, or do you think there is a mistake there?
Re point 2: I’m not confident that’s an accurate summary; I’m precisely proposing that the agent learn a model of the world containing a model of the agent (approximate or precise). I agree that evaluating this kind of model will require logical uncertainty or similar magic, since it will be expensive and possibly self-referential.
Re point 3: I see what you mean, though for self-modeling the agent being predicted should only be as smart as the agent doing the prediction. It seems like approximation and logical uncertainty are the main ingredients needed here. Are there particular parts of the unbounded problem that are not solved by reflective oracles?
Thanks, Paul—I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.
I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably—are there reasons or intuition-pumps you know of that have a bearing on this?
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
-
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
-
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
-
World-models containing self-models
I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”
A couple of notes on paragraph 4:
I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties.
You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about.
You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it—it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question—does that make sense, or do I seem off-base?
Thanks, Jessica. This argument still doesn’t seem right to me—let me try to explain why.
It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won’t need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it’ll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?
(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)
“Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.”
I’ve had this conversation with Nate before, and I don’t understand why I should think it’s true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?
Very thoughtful post! I was so impressed that I clicked the username to see who it was, only to see the link to your LessWrong profile :)
Request for comments: introductory research guide
Just wanted to mention that watching this panel was one of the things that convinced me to give AI safety research a try :) Thanks for re-posting, it’s a good memory.
To at least try to address your question: one effect could be that there are coordination problems, where many people would be trying to “change the world” in roughly the same direction if they knew that other people would cooperate and work with them. This would result in less of the attention drain you suggest. This seems more like what I’ve experienced.
I’m more worried about people being stupid than mean, but that could be an effect of the bubble of non-mean people I’m part of.
My comment, for the record:
I’m glad to see people critiquing Paul’s work—it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of “benign”, I don’t think humans are benign, so I’m not going to argue with that. Instead, I’ll say what I think about building aligned AIs out of simulated human judgement.
I agree with you that listing and solving problems with such systems until we can’t think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won’t hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn’t feel like we’re at that point yet. I’m guessing the main difference here is that I’m hopeful about producing those arguments and you think it’s not likely to work.
Here’s an of example of how an argument might go. It’s sloppy, but I think it shows the flavor that makes me hopeful. Meta-execution preserving a “non-corrupting” invariant:
i. define a naturally occurring set of queries nQ.
ii. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are “non-corrupting”).
iii. let Q be the closure of nQ under “Som spends an hour splitting q into sub-queries”.
iv. have some reason to think that Som’s processing never purposefully converts non-corrupting queries into corrupting ones.
v. have some defense against random noise producing corrupting nq or q.
vi. conclude that all q in Q are non-corrupting, and so the system won’t involve any value-drifted Soms.
This kind of system would run sort of like your (2) or Paul’s meta-execution (https://ai-alignment.com/meta-execution-27ba9b34d377).
There are some domains where this argument seems clearly true and Som isn’t just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain—no Go problems are corrupting—and Som’s processing doesn’t contribute to the truth of (iii).
For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som’s part to convert a non-scary q into a scary q’ and that Som wouldn’t want to do this unless they were already corrupted, and (v) can be made true by using a lot of different “noise seeds” and some kind of voting system to wash out noise-produced corruption.
Obviously this argument is frustratingly informal, and maybe I could become convinced that it can’t be strengthened, but I think I’d mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.
Paul seems to have another kind of argument for another kind of system in mind here (https://ai-alignment.com/aligned-search-366f983742e9), with a sketch of an argument at “I have a rough angle of attack in mind”. Obviously this isn’t an argument yet, but it seems worth looking into.
FWIW, Paul is thinking and writing about about the kinds of problems you point out, e.g. in this post (https://ai-alignment.com/security-amplification-f4931419f903), this post (https://ai-alignment.com/reliability-amplification-a96efa115687), or this post (https://ai-alignment.com/implementing-our-considered-judgment-6c715a239b3e, search “virus” on that page). Not sure if his thoughts are helpful to you.
If you’re planning to follow up this post, I’d be most interested in whether you think it’s not likely to be possible to design a process that can we can be confident will avoid Sim drift. I’d also be interested to know if there are other approaches to alignment that seem more promising to you.