At MIRI we tend to take “we’re probably fine” as a strong indication that we’re not fine ;p
(I should remark that I don’t mean to speak “for MIRI” and probably should have phrased myself in a way which avoided generalizing across opinions at MIRI.)
Yeah I have been and continue to be confused by this perspective, at least as an empirical claim (as opposed to a normative one). I get the sense that it’s partly because optimization amplifies and so there is no “probably”, there is only one or the other. I can kinda see that when you assume an arbitrarily powerful AIXI-like superintelligence, but it seems basically wrong when you expect the AI system to apply optimization that’s not ridiculously far above that applied by a human.
I think I would agree with this if you said “optimization that’s at or below human level” rather than “not ridiculously far above”.
Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great. In particular, this system could do subtle things resulting in longer-term value shift, influence alignment research to go down particular paths, etc. (I realize the hypothetical scenario has at least some safeguards, so I won’t go into more extreme scenarios like winning at politics hard enough to become world dictator and set the entire future path of humanity, etc. But I find this pretty plausible in a generic “moderately above human” scenario. Society rewards top performers disproportionately for small differences. Being slightly better than any human author could get you not only a fortune in book sales, but a hugely disproportionate influence. So it does seem to me like you’d need to be pretty sure of whatever safeguards you have in place, particularly given the possibility that you mis-estimate capability, and given the possibility that the system will improve its capabilities in ways you may not anticipate.)
But really, mainly, I was making the normative claim. A culture of safety is not one in which “it’s probably fine” is allowed as part of any real argument. Any time someone is tempted to say “it’s probably fine”, it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many “it’s probably fine” arguments; so at best you should carefully count how many you allow yourself.
A relevant empirical claim sitting behind this normative intuition is something like: “without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards”.
This all seems pretty closely related to Eliezer’swriting on security mindset.
Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback.
This seems like a perfect example to me. It works pretty well for current systems, but scaled up, it seems it would ultimately reach dramatically wrong ideas about what humans value.
Yup, totally agree. Make sure to update it as you scale up further.
You said that you don’t think learning human values is a good target for “X”, so I worry that focusing on this will be a bit unfair to your perspective. But it’s also the most straightforward example, and we both seem to agree that it illustrates things we care about here. So I’m just going to lampshade the fact that I’ll use “human values” as an example a lot in what follows.
The point of this is to be able to handle feedback at arbitrary levels, not to require feedback at arbitrary levels.
What’s the point of handling feedback at high levels if we never actually get feedback at those levels?
I think what’s really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that’s necessary (to get concepts we care about right). Accepting feedback at all those levels is a proxy. (I want to include it in my list of criteria because I don’t know a better way of operationalising “reasoning at all levels”, and also, because I don’t have a fixed meta-level at which I’d be happy to cap feedback. Capping meta-levels at something like 1K doesn’t seem like it would result in a better research agenda.)
Sort of like Bayes’ Law promises to let you update on anything. You wouldn’t travel back in time and tell Bayes “what’s the point of researching updating on anything, when in fact we only ever need to update on some relatively narrow set of propositions relating to the human senses?” It’s not a perfect analogy, but it gets at part of the point.
My basic claim is that we’ve seen the same sorts of problems occur at multiple meta-levels, and each time, it’s tempting to retreat to another meta-level. I therefore want a theory of (these particular sorts of) meta-levels, because it’s plausible to me that in such a context, we can solve the general problem rather than continue to push it back. Or at least, that it would provide tools to better understand the problem.
There’s a perspective in which “having a fixed maximum meta-level at all” is pretty directly part of the problem. So it’s natural to see if we can design systems which don’t have that property.
From this perspective, it seems like my response to your “incrementally improve loss functions as capability levels rise” perspective should be:
It seems like this would just be a move you’d eventually want to make, anyway.
At some point, you don’t want to keep designing safe policies by hand; you want to optimize them to minimize some loss function.
At some point, you don’t want to keep designing safe loss functions by hand; you want to do value learning.
At some point, you don’t want to keep inventing better and better value-learning loss functions by hand; you want to learn-to-learn.
At some point, you won’t want to keep pushing back meta-levels like this; you’ll want to do it automatically.
From this perspective, I’d just be looking ahead in the curve. Which is pretty much what I think I’m doing anyway.
So although the discussion of MIRI-style security mindset and just-how-approximately-right safety concepts need to be seems relevant, it might not be the crux.
Perhaps another way of framing it: suppose we found out that humans were basically unable to give feedback at level 6 or above. Are you now happy having the same proposal, but limited to depth 5? I get the sense that you wouldn’t be, but I can’t square that with “you only need to be able to handle feedback at high levels but you don’t require such feedback”.
This depends. There are scenarios where this would significantly change my mind.
But let’s suppose humans have trouble with 6 or above just because it’s hard to keep that many meta-levels in working memory. How would my proposal function in this world?
We want the system to extrapolate to the higher levels, figuring out what humans (implicitly) believe. But a consistent (and highly salient) extrapolation is the one which mimics human ineptitude at those higher levels. So we need to be careful about what we mean by “extrapolate”.
What we want the system to do is reason as if it received the feedback we would have given if we had more working memory (to the extent that we endorse “humans with more working memory” as better reasoners). My proposal is that a system should be taught to do exactly this.
This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
Another way humans indirectly give evidence about higher levels is through their lower-level behavior. To some extent, we can infer from a human applying a specific form of reasoning, that the human reflectively endorses that style of reasoning. This idea can be used to transfer information about level N to some information about level N+1. But the system should learn caution about this inference, by observing cases where it fails (cases where humans habitually reason in a particular way, but don’t endorse doing so), as well as by direct instruction.
Even if humans only ever provide feedback about finitely many meta levels, the idea is for the system to generalize to other levels.
I don’t super see how this happens but I could imagine it does. (And if it did it would answer my question above.) I feel like I would benefit from concrete examples with specific questions and their answers.
A lot of ideas apply to many meta-levels; EG, the above heuristic is an example of something which generalizes to many meta-levels. (It is true in general that you can make a probabilistic inference about level N+1 by supposing level-N activity is probably endorsed; and, factors influencing the accuracy of this heuristic probably generalize across levels. Applying this to human behavior might only get us examples at a few meta-levels, but the principle should also be applied to idealized humans, EG the model of humans with more working memory. So it can continue to bear fruit at many meta-levels, even when actual human feedback is not available.)
Importantly, process-level feedback usually applies directly to all meta-levels. This isn’t a matter of generalization of feedback to multiple levels, but rather, direct feedback about reasoning which applies at all levels.
For example, humans might give feedback about how to do sensible probabilistic reasoning. This information could be useful to the system at many meta-levels. For example, it might end up forming a general heuristic that its value functions (at every meta-level) should be expectation functions which quantify uncertainty about important factors. (Or it could be taught this particular idea directly.)
More importantly, anti-daemon ideas would apply at every meta-level. Every meta-level would include inner-alignment checks as a heavily weighted part of its evaluation function, and at all levels, proposal distributions should heavily avoid problematic parts of the search space.
Okay, that makes sense. It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.
In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”). In practice that would be intractable, but that’s the ideal which practical implementations should aim to approximate.
So, yes, it’s a problem, but it’s one that implementations should aim to mitigate.
(Part of how one might aim to mitigate this is to teach the system that it’s a good idea to try to approximate this ideal. But then it’s particularly important to introduce this idea early in training, to avoid the failure mode you mention; so the point stands.)
For example, Learning to summarize from human feedback does use Boltzmann rationality, but could finetune GPT-3 to e.g. interpret human instructions pragmatically. This interpretation system can apply “at all levels”, in the same way that human brains can apply similar heuristics “at all levels”.
(There are still issues with just applying the learning from human preferences approach, but they seem to be much more about “did the neural net really learn the intended concept” / inner alignment, rather than “the neural net learned what to do at level 1 but not at any of the higher levels”.)
It seems to me like you’re trying to illustrate something like “Abram’s proposal doesn’t get at the bottlenecks”.
I think it’s pretty plausible that this agenda doesn’t yield significant fruit with respect to several important alignment problems, and instead (at best) yields a scheme which would depend on other solutions to those particular problems.
It’s also plausible that those solutions would, themselves, be sufficient for alignment, rendering this research direction extraneous.
In particular, it’s plausible to me that iterated amplification schemes (including Paul’s schemes and mine) require a high level of meta-competence to get started, such that achieving that initial level of competence already requires a method of aligning superhuman intelligence, making anything else unnecessary. (This was one of Eliezer’s critiques of iterated amplification.)
However, the world you describe, in which alignment tech remains imperfect (wrt scaling) for a long time, but we can align successively more intelligent agents with successively refined tools, is not one of those worlds. In that world, it is possible to make incrementally more capable agents incrementally more perfectly aligned, until that point at which we have something smart (and aligned) enough to serve as the base case for an iterated amplification scheme. In that world, the scheme I describe could be just one of the levels of alignment tech which end up useful at some point.
One example of how to do this is to use X = “revert to a safe baseline policy outside of <whitelist>”, and enlarge the whitelist over time. In this case “failing to scale” is “our AI system couldn’t solve the task because our whitelist hobbled it too much”.
I’m curious how you see whitelisting working.
It feels like your beliefs about what kind of methods might work for “merely way-better-than-human” systems are a big difference between you and I, which might be worth discussing more, although I don’t know if it’s very central to everything else we’re discussing.
I think what’s really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that’s necessary (to get concepts we care about right).
I’m super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are “about all the levels”.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”).
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
It seems like you have to choose one of two options:
Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
Order of feedback doesn’t matter, in which case you can’t improve your interpretation of feedback over time (at least, not in a consistent way)
(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)
It seems to me like you’re trying to illustrate something like “Abram’s proposal doesn’t get at the bottlenecks”.
I think it’s more like “I’m not clear on the benefit of this proposal over (say) learning from comparisons”. I’m not asking about bottlenecks; I’m asking about what the improvement is.
I’m curious how you see whitelisting working.
The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn’t in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.
Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we’d be doing this on the classifier, if that’s the route we went down.
It feels like your beliefs about what kind of methods might work for “merely way-better-than-human” systems are a big difference between you and I, which might be worth discussing more, although I don’t know if it’s very central to everything else we’re discussing.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
OK. Looking back, the post really doesn’t address this, so I can understand why you’re confused.
My basic argument for cross-level generalization is that a QAS has to be represented compactly while being prepared to answer questions at any level; so, it has to generalize across levels. But there are also other effects.
So, suppose I give the system feedback about some specific 3rd-level judgement. The way I imagine this happening is that the feedback gets added to a big dataset. Evaluating QASs on this dataset is part of how the initial value function, Hv, does its thing.Hv also should prefer QASs which produce value functions which are pretty similar to Hv, so that this property is approximately preserved as the system gets amplified. So, a few things happen:
The feedback is added to the dataset, so it is used to judge the next generation of QASs (really the next learned distribution over QASs) so they will avoid doing poorly on this 3rd-level judgement.
This creates some cross-level generalization, because the QASs which perform poorly on this probably do so for reasons not isolated to 3rd-level judgments. In NN terms, there are shared hidden neurons which serve multiple different levels. In algorithmic information theory, there is mutual information between levels, so programs which do well will share information across levels rather than represent them all separately.
The feedback is also used as an example of how to judge (ie the fourth-level skill which would be able to generate the specific 3rd-level feedback). This also constrains the next generation of QASs, and so similarly has a cross-level generalization effect, due to shared information in the QAS representation (eg multi-level neurons, bits of code relevant across multiple levels, etc).
Similarly, this provides indirect evidence about 5th level, 6th level, etc because just as the 4th level needs to be such that it could have generated the 3rd-level feedback, the 5th level needs to be such that it would approve of such 4th levels, the 6th level needs to approve of 5th levels with that property, and so on.
So you can see, feedback on one level propagates information to all the other levels along many pathways.
>This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
I would argue it’s not true of the first; learning human preferences will fail to account for ways humans agree human preference judgments are error-prone (eg the story about how judges judge more harshly right before lunch).
As for iterated amplification, it definitely has this property “in spirit” (ie if everything works well), but whether particular versions have this property is another question. Specifically, it’s possible to ask “how should I answer questions like this?” and such meta-questions, to try to get debiasing information before coming up with a strategy to answer a question. However, it’s up to the human in the box to come up with these strategies, and you can’t go meta too much without going into an infinite loop. And the human in the box also has to have good strategy for searching for this kind of meta-info.
>In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”).
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. Subsequent value functions also look at this big pool. So in the hypothetical where we re-start the training every time we give feedback,
First, the initial naive interpretation is used, on every piece of feedback ever. This helps define D1, the first learned distribution on QASs.
Then, D1 uses its new, slightly refined interpretation of all the feedback to form new judgments of QAS quality, which help define D2.
We keep iterating like this, getting better interpretations of feedback which we use to generate even better interpretations. We do this until we reach some stopping point, which might depend on safety concerns (eg stopping while we’re confident it has not drifted too much)
We then interact with the resulting system, generating more feedback for a while, until we have produced enough feedback that we wan to re-start the process again.
This procedure ensures that the system doesn’t outright ignore any feedback due to overconfidence (because all feedback is used by Hv every restart), while also ensuring that the most sophisticated model is (eventually) used to interpret feedback. The result (if you iterate to convergence) is a fixed-point where the distribution Dn would reproduce itself, so in a significant sense, the end result is as if you used the most sophisticated feedback-interpretation model from the beginning. At the same time, what you actually use at the beginning is the naive feedback interpretation model, which gives us the guarantee that EG if you stomp out a self-aggrandizing mental pattern (which would pointedly ignore feedback against itself), it actually gets stomped out.
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. [...]
Okay, I think with this elaboration I stand by what I originally said:
It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
Okay, I think with this elaboration I stand by what I originally said
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Responding first to the general approach to good-enough alignment:
I think I would agree with this if you said “optimization that’s at or below human level” rather than “not ridiculously far above”.
Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great.
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization “sneaks through” this design process is probably not going to have much impact on the agent’s performance, or we would have already caught it.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
But really, mainly, I was making the normative claim. A culture of safety is not one in which “it’s probably fine” is allowed as part of any real argument. Any time someone is tempted to say “it’s probably fine”, it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many “it’s probably fine” arguments; so at best you should carefully count how many you allow yourself.
A relevant empirical claim sitting behind this normative intuition is something like: “without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards”.
If your claim is just that “we’re probably fine” is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.
This all seems pretty closely related to Eliezer’swriting on security mindset.
Some thoughts here:
I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is… a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
I don’t really see why the ML-based approaches don’t satisfy the requirement of being based on security mindset. (I agree “we’re probably fine” does not satisfy that requirement.) Note that there isn’t a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I’m just claiming that ML-based approaches seem like they can be “sufficiently” security-mindset-y.
I might be completely misunderstanding the point Eliezer is trying to make, because it’s stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want.
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.
(I should remark that I don’t mean to speak “for MIRI” and probably should have phrased myself in a way which avoided generalizing across opinions at MIRI.)
I think I would agree with this if you said “optimization that’s at or below human level” rather than “not ridiculously far above”.
Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great. In particular, this system could do subtle things resulting in longer-term value shift, influence alignment research to go down particular paths, etc. (I realize the hypothetical scenario has at least some safeguards, so I won’t go into more extreme scenarios like winning at politics hard enough to become world dictator and set the entire future path of humanity, etc. But I find this pretty plausible in a generic “moderately above human” scenario. Society rewards top performers disproportionately for small differences. Being slightly better than any human author could get you not only a fortune in book sales, but a hugely disproportionate influence. So it does seem to me like you’d need to be pretty sure of whatever safeguards you have in place, particularly given the possibility that you mis-estimate capability, and given the possibility that the system will improve its capabilities in ways you may not anticipate.)
But really, mainly, I was making the normative claim. A culture of safety is not one in which “it’s probably fine” is allowed as part of any real argument. Any time someone is tempted to say “it’s probably fine”, it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many “it’s probably fine” arguments; so at best you should carefully count how many you allow yourself.
A relevant empirical claim sitting behind this normative intuition is something like: “without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards”.
This all seems pretty closely related to Eliezer’s writing on security mindset.
You said that you don’t think learning human values is a good target for “X”, so I worry that focusing on this will be a bit unfair to your perspective. But it’s also the most straightforward example, and we both seem to agree that it illustrates things we care about here. So I’m just going to lampshade the fact that I’ll use “human values” as an example a lot in what follows.
I think what’s really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that’s necessary (to get concepts we care about right). Accepting feedback at all those levels is a proxy. (I want to include it in my list of criteria because I don’t know a better way of operationalising “reasoning at all levels”, and also, because I don’t have a fixed meta-level at which I’d be happy to cap feedback. Capping meta-levels at something like 1K doesn’t seem like it would result in a better research agenda.)
Sort of like Bayes’ Law promises to let you update on anything. You wouldn’t travel back in time and tell Bayes “what’s the point of researching updating on anything, when in fact we only ever need to update on some relatively narrow set of propositions relating to the human senses?” It’s not a perfect analogy, but it gets at part of the point.
My basic claim is that we’ve seen the same sorts of problems occur at multiple meta-levels, and each time, it’s tempting to retreat to another meta-level. I therefore want a theory of (these particular sorts of) meta-levels, because it’s plausible to me that in such a context, we can solve the general problem rather than continue to push it back. Or at least, that it would provide tools to better understand the problem.
There’s a perspective in which “having a fixed maximum meta-level at all” is pretty directly part of the problem. So it’s natural to see if we can design systems which don’t have that property.
From this perspective, it seems like my response to your “incrementally improve loss functions as capability levels rise” perspective should be:
It seems like this would just be a move you’d eventually want to make, anyway.
At some point, you don’t want to keep designing safe policies by hand; you want to optimize them to minimize some loss function.
At some point, you don’t want to keep designing safe loss functions by hand; you want to do value learning.
At some point, you don’t want to keep inventing better and better value-learning loss functions by hand; you want to learn-to-learn.
At some point, you won’t want to keep pushing back meta-levels like this; you’ll want to do it automatically.
From this perspective, I’d just be looking ahead in the curve. Which is pretty much what I think I’m doing anyway.
So although the discussion of MIRI-style security mindset and just-how-approximately-right safety concepts need to be seems relevant, it might not be the crux.
This depends. There are scenarios where this would significantly change my mind.
But let’s suppose humans have trouble with 6 or above just because it’s hard to keep that many meta-levels in working memory. How would my proposal function in this world?
We want the system to extrapolate to the higher levels, figuring out what humans (implicitly) believe. But a consistent (and highly salient) extrapolation is the one which mimics human ineptitude at those higher levels. So we need to be careful about what we mean by “extrapolate”.
What we want the system to do is reason as if it received the feedback we would have given if we had more working memory (to the extent that we endorse “humans with more working memory” as better reasoners). My proposal is that a system should be taught to do exactly this.
This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
Another way humans indirectly give evidence about higher levels is through their lower-level behavior. To some extent, we can infer from a human applying a specific form of reasoning, that the human reflectively endorses that style of reasoning. This idea can be used to transfer information about level N to some information about level N+1. But the system should learn caution about this inference, by observing cases where it fails (cases where humans habitually reason in a particular way, but don’t endorse doing so), as well as by direct instruction.
A lot of ideas apply to many meta-levels; EG, the above heuristic is an example of something which generalizes to many meta-levels. (It is true in general that you can make a probabilistic inference about level N+1 by supposing level-N activity is probably endorsed; and, factors influencing the accuracy of this heuristic probably generalize across levels. Applying this to human behavior might only get us examples at a few meta-levels, but the principle should also be applied to idealized humans, EG the model of humans with more working memory. So it can continue to bear fruit at many meta-levels, even when actual human feedback is not available.)
Importantly, process-level feedback usually applies directly to all meta-levels. This isn’t a matter of generalization of feedback to multiple levels, but rather, direct feedback about reasoning which applies at all levels.
For example, humans might give feedback about how to do sensible probabilistic reasoning. This information could be useful to the system at many meta-levels. For example, it might end up forming a general heuristic that its value functions (at every meta-level) should be expectation functions which quantify uncertainty about important factors. (Or it could be taught this particular idea directly.)
More importantly, anti-daemon ideas would apply at every meta-level. Every meta-level would include inner-alignment checks as a heavily weighted part of its evaluation function, and at all levels, proposal distributions should heavily avoid problematic parts of the search space.
In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”). In practice that would be intractable, but that’s the ideal which practical implementations should aim to approximate.
So, yes, it’s a problem, but it’s one that implementations should aim to mitigate.
(Part of how one might aim to mitigate this is to teach the system that it’s a good idea to try to approximate this ideal. But then it’s particularly important to introduce this idea early in training, to avoid the failure mode you mention; so the point stands.)
It seems to me like you’re trying to illustrate something like “Abram’s proposal doesn’t get at the bottlenecks”.
I think it’s pretty plausible that this agenda doesn’t yield significant fruit with respect to several important alignment problems, and instead (at best) yields a scheme which would depend on other solutions to those particular problems.
It’s also plausible that those solutions would, themselves, be sufficient for alignment, rendering this research direction extraneous.
In particular, it’s plausible to me that iterated amplification schemes (including Paul’s schemes and mine) require a high level of meta-competence to get started, such that achieving that initial level of competence already requires a method of aligning superhuman intelligence, making anything else unnecessary. (This was one of Eliezer’s critiques of iterated amplification.)
However, the world you describe, in which alignment tech remains imperfect (wrt scaling) for a long time, but we can align successively more intelligent agents with successively refined tools, is not one of those worlds. In that world, it is possible to make incrementally more capable agents incrementally more perfectly aligned, until that point at which we have something smart (and aligned) enough to serve as the base case for an iterated amplification scheme. In that world, the scheme I describe could be just one of the levels of alignment tech which end up useful at some point.
I’m curious how you see whitelisting working.
It feels like your beliefs about what kind of methods might work for “merely way-better-than-human” systems are a big difference between you and I, which might be worth discussing more, although I don’t know if it’s very central to everything else we’re discussing.
I’m super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are “about all the levels”.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
It seems like you have to choose one of two options:
Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
Order of feedback doesn’t matter, in which case you can’t improve your interpretation of feedback over time (at least, not in a consistent way)
(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)
I think it’s more like “I’m not clear on the benefit of this proposal over (say) learning from comparisons”. I’m not asking about bottlenecks; I’m asking about what the improvement is.
The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn’t in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.
Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we’d be doing this on the classifier, if that’s the route we went down.
(Addressed this in the other comment)
OK. Looking back, the post really doesn’t address this, so I can understand why you’re confused.
My basic argument for cross-level generalization is that a QAS has to be represented compactly while being prepared to answer questions at any level; so, it has to generalize across levels. But there are also other effects.
So, suppose I give the system feedback about some specific 3rd-level judgement. The way I imagine this happening is that the feedback gets added to a big dataset. Evaluating QASs on this dataset is part of how the initial value function, Hv, does its thing.Hv also should prefer QASs which produce value functions which are pretty similar to Hv, so that this property is approximately preserved as the system gets amplified. So, a few things happen:
The feedback is added to the dataset, so it is used to judge the next generation of QASs (really the next learned distribution over QASs) so they will avoid doing poorly on this 3rd-level judgement.
This creates some cross-level generalization, because the QASs which perform poorly on this probably do so for reasons not isolated to 3rd-level judgments. In NN terms, there are shared hidden neurons which serve multiple different levels. In algorithmic information theory, there is mutual information between levels, so programs which do well will share information across levels rather than represent them all separately.
The feedback is also used as an example of how to judge (ie the fourth-level skill which would be able to generate the specific 3rd-level feedback). This also constrains the next generation of QASs, and so similarly has a cross-level generalization effect, due to shared information in the QAS representation (eg multi-level neurons, bits of code relevant across multiple levels, etc).
Similarly, this provides indirect evidence about 5th level, 6th level, etc because just as the 4th level needs to be such that it could have generated the 3rd-level feedback, the 5th level needs to be such that it would approve of such 4th levels, the 6th level needs to approve of 5th levels with that property, and so on.
So you can see, feedback on one level propagates information to all the other levels along many pathways.
I would argue it’s not true of the first; learning human preferences will fail to account for ways humans agree human preference judgments are error-prone (eg the story about how judges judge more harshly right before lunch).
As for iterated amplification, it definitely has this property “in spirit” (ie if everything works well), but whether particular versions have this property is another question. Specifically, it’s possible to ask “how should I answer questions like this?” and such meta-questions, to try to get debiasing information before coming up with a strategy to answer a question. However, it’s up to the human in the box to come up with these strategies, and you can’t go meta too much without going into an infinite loop. And the human in the box also has to have good strategy for searching for this kind of meta-info.
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. Subsequent value functions also look at this big pool. So in the hypothetical where we re-start the training every time we give feedback,
First, the initial naive interpretation is used, on every piece of feedback ever. This helps define D1, the first learned distribution on QASs.
Then, D1 uses its new, slightly refined interpretation of all the feedback to form new judgments of QAS quality, which help define D2.
We keep iterating like this, getting better interpretations of feedback which we use to generate even better interpretations. We do this until we reach some stopping point, which might depend on safety concerns (eg stopping while we’re confident it has not drifted too much)
We then interact with the resulting system, generating more feedback for a while, until we have produced enough feedback that we wan to re-start the process again.
This procedure ensures that the system doesn’t outright ignore any feedback due to overconfidence (because all feedback is used by Hv every restart), while also ensuring that the most sophisticated model is (eventually) used to interpret feedback. The result (if you iterate to convergence) is a fixed-point where the distribution Dn would reproduce itself, so in a significant sense, the end result is as if you used the most sophisticated feedback-interpretation model from the beginning. At the same time, what you actually use at the beginning is the naive feedback interpretation model, which gives us the guarantee that EG if you stomp out a self-aggrandizing mental pattern (which would pointedly ignore feedback against itself), it actually gets stomped out.
That’s the ideal I’d shoot for.
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Okay, I think with this elaboration I stand by what I originally said:
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Fair.
Responding first to the general approach to good-enough alignment:
Less important response: If by “not great” you mean “existentially risky”, then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.
My real objection: Your claim is about what happens after you’ve already failed, in some sense—you’re starting from the assumption that you’ve deployed a misaligned agent. From my perspective, you need to start from a story in which we’re designing an AI system, that will eventually have let’s say “5x the intelligence of a human”, whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization “sneaks through” this design process is probably not going to have much impact on the agent’s performance, or we would have already caught it.
Possibly related: I don’t like thinking of this in terms of how “wrong” the values are, because that doesn’t allow you to make distinctions about whether behaviors have already been seen at training or not.
If your claim is just that “we’re probably fine” is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.
Some thoughts here:
I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is… a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
I don’t really see why the ML-based approaches don’t satisfy the requirement of being based on security mindset. (I agree “we’re probably fine” does not satisfy that requirement.) Note that there isn’t a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I’m just claiming that ML-based approaches seem like they can be “sufficiently” security-mindset-y.
I might be completely misunderstanding the point Eliezer is trying to make, because it’s stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
This seems like a pretty big disagreement, which I don’t expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.
My answer to this would be, mainly because they weren’t living in times as risky as ours; for example, they were not born and raised in a literal AGI lab (which the hypothetical system would be).
The scenario we were discussing was one where robustness to scale is ignored as a criteria, so my concern is that the system turns out more intelligent than expected, and hence, tools like EG asking earlier iterations of the same system to help examine the cognition may fail. If you’re pretty confident that your alignment strategy is sufficient for 5x human, then you have to be pretty confident that the system is indeed 5x human. This can be difficult due to the difference between task performance and the intelligence of inner optimisers. For example, GPT-3 can mimic humans moderately well (very impressive by today’s standards, obviously, but moderately well in the grand scope of things). However, it can mimic a variety of humans, in a way that’s in some sense much better than any one human. This makes it obvious that GPT-3 is smarter than it lets on; it’s “playing dumb”. Presumably this is what led Ajeya to predict that GPT-3 can offer better medical advice than any doctor (if only we could get it to stop playing dumb).
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
If I was convinced of that, my next question would be how we can be that confident that we have 5x-appropriate tools. Part of why I tend toward “robustness to scale” is that it seems difficult to make strong scale-dependent arguments, except at the scales we can empirically investigate (so not very useful for scaling up to 5x human, until the point at which we can safely experiment at that level, at which point we must have solved the safety problem at that level in other ways). But OTOH you’re right that it’s hard to make strong scale-independent arguments, too. So this isn’t as important to the crux.
Right, I agree that it’s a potentially misleading framing, particularly in a context where we’re already discussing stuff like process-level feedback.
This makes sense, though I probably shouldn’t have used “5x” as my number—it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like “we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn’t depend significantly on the current compute / capacity / data”.