For example, lots of discussion of IRL and value learning seem to presuppose that we’re writing code that tells the AGI specifically how to model a human. To pick a random example, in Vanessa Kosoy’s 2018 research agenda, the “demonstration” and “learning by teaching” ideas seem to rely on being able to do that—I don’t see how we could possibly do those things if the whole world-model is a bunch of unlabeled patterns in patterns in patterns in sensory input etc.
We can at least try doing those things by just having specific channels through which human actions enter the system. For example, maybe it’s enough to focus on what the human posts on Facebook, so the AI just needs to look at that. The problem with this is, it leaves us open to attack vectors in which the channel in question is hijacked. On the other hand, even if we had a robust way to point to the human brain, we would still have attack vectors in which the human themself gets “hacked” somehow.
In principle, I can imagine solving these problems by somehow having a robust definition of “unhacked human”, which is what you’re going for, I think. But there might be a different type of solution in which we just avoid entering “corrupt” states in which the content of the channel diverges from what we intended. For example, this might be achievable by quantilizing imitation.
Thanks!!! After reading your comment and thinking about it more, here’s where I’m at:
Your “demonstration” thing was described as “The [AI] observes a human pursuing eir values and deduces the values from the behavior.”
When I read that, I was visualizing a robot and a human standing in a room, and the human is cooking, and the robot is watching the human and figuring out what the human is trying to do. And I was thinking that there needs to be some extra story for how that works, assuming that the robot has come to understand the world by building a giant unlabeled Bayes net world-model, and that it processes new visual inputs by slotting them into that model. (And that’s my normal assumption, since that’s how I think the neocortex works, and therefore that’s a plausible way that people might build AGI, and it’s the one I’m mainly focused on.)
So as the robot is watching the human soak lentils, the thing going on in its head is: “Pattern 957823, and Pattern 5672928, and Pattern 657192, and…”. In order to have the robot assign a special status to the human’s deliberate actions, we would need to find “the human’s deliberate actions” somewhere in the unlabeled world-model, i.e. solve a symbol-grounding problem, and doing so reliably is not straightforward.
However, maybe I was visualizing the wrong thing, with the robot and human in the room. Maybe I should have instead been visualizing a human using a computer via its keyboard. Then the AI can have a special input channel for the keystrokes that the human types. And every single one of those keystrokes is automatically treated as “the human’s deliberate action”. This seems to avoid the symbol-grounding problem I mentioned above. And if there’s a special input channel, we can use supervised learning to build a probabilistic model of that input channel. (I definitely think this step is compatible with the neocortical algorithm.) So now we have a human policy—i.e., what the AI thinks the human would do next, in any real or imagined circumstance, at least in terms of which keystrokes they would type. I’m still a bit hazy on what happens next in the plan—i.e., getting from that probabilistic model to the more abstract “what the human wants”. At least in general. And a big part of that is, again, symbol-grounding—as soon as we step away from the concrete predictions coming out of the “human keystroke probabilistic model”, we’re up in the land of “World-model Pattern #8673028″ etc. where we can’t really do anything useful. (I do see how the rest of the plan could work along these lines, where we install a second special human-to-AI information channel where the human says how things are going, and the AI builds a predictive model of that too, and then we predict-and-quantilize from the human policy.†)
It’s still worth noting that I, Steve, personally can be standing in a room with another human H2, watching them cook, and I can figure out what H2 is trying to do. And if H2 is someone I really admire, I will automatically start wanting to do the things that H2 is trying to do. So human social instincts do seem to have a way forward through the symbol-grounding path above, and not through the special-input-channel path, and I continue to think that this symbol-grounding method has something to do with empathetic simulation, but I’m hazy on the details, and I continue to think that it would be very good to understand better how exactly it works.
†This does seem to be a nice end-to-end story by the way. So have we solved the alignment problem? No… You mention channel corruption as a concern, and it is, but I’m even more concerned about this kind of design hitting a capabilities wall dramatically earlier than unsafe AGIs would. Specifically, I think it’s important that an AGI be able to do things like “come up with a new way to conceptualize the alignment problem”, and I think doing those things requires goal-seeking-RL-type exploration (e.g. exploring different possible mathematical formalizations or whatever) within a space of mental “actions” none of which it has ever seen a human take. I don’t think that this kind of AGI approach would be able to do that, but I could be wrong. That’s another reason that I’m hoping something good will come out of the symbol-grounding path informed by how human social instincts work.
I’m still a bit hazy on what happens next in the plan—i.e., getting from that probabilistic model to the more abstract “what the human wants”.
Well, one thing you could try is using the AIT definition of goal-directedness to go from the policy to the utility function. However, in general it might require knowledge of the human’s counterfactual behavior which the AI doesn’t have. Maybe there are some natural assumption under which it is possible, but it’s not clear.
It’s still worth noting that I, Steve, personally can be standing in a room with another human H2, watching them cook, and I can figure out what H2 is trying to do.
I feel the appeal of this intuition, but on the other hand, it might be a much easier problem since both of you are humans doing fairly “normal” human things. It is less obvious you would be able to watch something completely alien and unambiguously figure out what it’s trying to do.
....I’m even more concerned about this kind of design hitting a capabilities wall dramatically earlier than unsafe AGIs would.
To first approximation, it is enough for the AI to be more capable than us, since, whatever different solution we might come up with, an AI which is more capable than us would come up with a solution at least as good. Quantilizing from an imitation baseline seems like it should achieve that, since the baseline is “as capable as us” and arguably quantilization would produce significant improvement over that.
Specifically, I think it’s important that an AGI be able to do things like “come up with a new way to conceptualize the alignment problem”, and I think doing those things requires goal-seeking-RL-type exploration (e.g. exploring different possible mathematical formalizations or whatever) within a space of mental “actions” none of which it has ever seen a human take.
Instead of “actions the AI has seen a human take”, a better way to think about it is “actions the AI can confidently predict a human could take (with sufficient probability)”.
Thanks again for your very helpful response! I thought about the quantilization thing more, let me try again.
As background, to a first approximation, let’s say 5 times per second I (a human) “think a thought”. That involves a pair of two things:
(Possibly) update my world-model
(Possibly) take an action—in this case, type a key at the keyboard
Of these two things, the first one is especially important, because that’s where things get “figured out”. (Imagine staring into space while thinking about something.)
OK, now back to the AI. I can broadly imagine two strategies for a quantilization approach:
Build a model of the human policy from a superior epistemic vantage point: So here we give the AI its own world-model that needn’t have anything to do with the human’s, and likewise allow the AI to update its world-model in a way that needn’t have anything to do with how the human updates their world model. Then the AI leverages its superior world-model in the course of learning and quantilizing the human policy (maybe just the action part of the policy, or maybe both the actions and the world-model-updates, it doesn’t matter for the moment).
Straightforward human imitation: Here, we try to get to a place where the AI is learning about the world and figuring things out in a (quantilized) human-like way. So we want the AI to sample from the human policy for “taking an action”, and we want the AI to sample from the human policy for “updating the world-model”. And the AI doesn’t know anything about the world beyond what it learns through those quantilized-human-like world-model updates.
Start with the first one. If the AI is going to get to a superior epistemic vantage point, then it needs to “figure things out” about the world and concepts and so on, and as I said before, I think “figuring things out” requires goal-seeking-RL-type exploration (e.g. exploring different possible mathematical formalizations or whatever) within a space of mental “actions”. So we still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it. And since this is not a human-imitating system, we can’t fall back on that. So this doesn’t seem like we made much progress on the problem.
For the second one, well, I think I’m kinda more excited about this one.
Naively, it does seem hard though. Recall that in this approach we need to imitate both aspects of the human policy—plausibly-human actions, and plausibly-human world-model-updates. This seems hard, because the AI only sees the human’s actions, not its world-model updates. Can it infer the latter? I’m a bit pessimistic here, at least by default. Well, I’m optimistic that you can infer an underlying world-model from actions—based on e.g. GPT-3. But here, we’re not merely hoping to learn a snapshot of the human model, but also to learn all the human’s model-update steps. Intuitively, even when a human is talking to another human, it’s awfully hard to communicate the sequence of thoughts that led you to come up with an idea. Heck, it’s hard enough to understand how I myself figured something out, when it was in my own head five seconds ago. Another way to think about it is, you need a lot of data to constrain a world-model snapshot. So to constrain a world-model change, you presumably need a lot of data before the change, and a lot of data after the change. But “a lot of data” involves an extended period of time, which means there are thousands of sequential world-model changes all piled on top of each other, so it’s not a clean comparison.
A couple things that might help are (A) Giving the human a Kernel Flow or whatever and letting the AI access the data, and (B) Helping the inductive bias by running the AI on the same type of world-model data structure and inference algorithm as the human, and having it edit the model to get towards a place where its model and thought process exactly matches the human.
So that’s neat. But we don’t have a superior epistemic vantage point anymore. So how do we quantilize? I figure, we can use some form of amplification—most simply, run the model at superhuman speeds so that it can “think longer” than the human on a given task. Or roll out different possible trains of thought in parallel, and ranking how well they turn out. Or something. But I feel like once we’re doing all that stuff, we can just throw out the quantilization part of the story, and instead our safety story can be that we’re starting with a deeply-human-like model and not straying too far from it, so hopefully it will remain well-behaved. That was my (non-quantilization) story here.
Sorry if I’m still confused; I’m very interested in your take, if you’re not sick of this discussion yet. :)
I gave a formal mathematical definition of (idealized) HDTL, so the answer to your question should probably be contained there. But I’m not entirely sure what it is since I don’t entirely understand the question.
The AI has a “superior epistemic vantage point” in the sense that, the prior ζ is richer than the prior that humans have. But, why do we “still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it”? The objective is fully specified.
A possible interpretation of your argument: a powerful AI would have to do something like TRL and access to the “envelope” computer can be unsafe in itself, because of possible side effects. That’s truly a serious problem! Essentially, it’s non-Cartesian daemons.
Atm I don’t have an extremely good solution to non-Cartesian daemons. Homomorphic cryptography can arguably solve it, but there’s large overhead. Possibly we can make do with some kind of obfuscation instead. Another vague idea I have is, make the AI avoid running computations which have side-effects predictable by the AI. In any case, more work is needed.
Recall that in this approach we need to imitate both aspects of the human policy—plausibly-human actions, and plausibly-human world-model-updates. This seems hard, because the AI only sees the human’s actions, not its world-model updates.
I don’t see why is it especially hard, it seems just like any system with unobservable degrees of freedom, which covers just about anything in the real world. So I would expect an AI with transformative capability to be able to do it. But maybe I’m just misunderstanding what you mean by this “approach number 2”. Perhaps you’re saying that it’s not enough to accurately predict the human actions, we need to have accurate pointers to particular gears inside the model. But I don’t think we do (maybe it’s because I’m following approach number 1).
why do we “still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it”? The objective is fully specified…
Thanks, that was a helpful comment. I think we’re making progress, or at least I’m learning a lot here. :)
I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.
So for example, let’s talk about some useful non-obvious concept, like “informational entropy”. And let’s suppose that our AI cannot learn the concept of “informational entropy” from humans, because we’re in an alternate universe where humans haven’t yet invented the concept of informational entropy. (Or replace “informational entropy” with “some important not-yet-discovered concept in AI alignment.)
In that case, I see three possibilities.
First, the AI never winds up “knowing about” informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.)
Second (I think this is your model?): the AI’s prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that’s equivalent to informational entropy. In other words: it “discovers” informational entropy by dumb process of elimination.
Third (this is my model): we get a prior by running a “prior-building AI”. This prior-building AI has “agency”; it “actively” learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive patterns, and good intuitions around how to find useful new concepts, etc. At some point it notices some interesting and relevant patterns, attends to them, and after a few minutes of trial-and-error exploration it eventually invents the concept of informational entropy. This new concept (and its web of implications) then gets incorporated into the AI’s new “priors” going forward, allowing the AI to make better predictions and formulate better plans in the future, and to discover yet more predictively-useful concepts, etc. OK, now we let this “prior-building AI” run and run, building an ever-better “prior” (a.k.a. “world-model”). And then at some point we can turn this AI off, and export this “prior” into some other AI algorithm. (Alternatively, we could also more simply just have one AI which is both the “prior-building AI” and the AI that does, um, whatever we want our AIs to do.)
It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.
However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a “superior epistemic vantage point”) in practice, and that the other approaches can’t.
So, I was thinking about the third approach—and that’s why I said “we still have the whole AGI alignment / control problem” (i.e., aligning and controlling the “prior-building AI”). Does that help?
I think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. “the AI’s prior has a combinatorial explosion” is true but “dumb process of elimination” is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn’t mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally (which might correspond to refinement in the infra-Bayesian sense).
My thesis here is that if the AI satisfies a (carefully fleshed out in much more detail) version of the HDTL desideratum, then it is safe and capable. How to make an efficient algorithm that satisfies such a desideratum is another question, but that’s a question from a somewhat different domain: specifically the domain of developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms. I see the latter effort as to first approximation orthogonal to the effort of finding good formal desiderata for safe TAI (and, it also receives plenty of attention from outside the existential safety community).
In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a “prior-building AI” with “agency” that is “trying” to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential “prior-building AI” that doesn’t also incidentally “try” to seize control of its off-switch.
Then in the parent comment you replied (IIUC) that if this is a problem at all, it’s not the problem you’re trying to solve (i.e. “finding good formal desiderata for safe TAI”), but a different problem (i.e. “developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms”), and my problem is “to a first approximation orthogonal” to your problem, and my problem “receives plenty of attention from outside the existential safety community”.
If so, my responses would be:
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people. :-P
I suspect that the problem doesn’t decompose the way you imply; instead I think that if we develop techniques for building a safe agential “prior-building AI”, we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever.
Even if that’s not true, I would still say that if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
I guess, when I read “learning algorithms with strong formal guarantees”, I’m imaging something like multi-armed bandit algorithms that have regret bounds. But I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
...if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
...I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
My model of Vanessa: We’re hopefully gonna find a learning algorithm with a provable regret bound (or something like that). Since seizing control of the off-switch would be very bad according to the objective function and thus violate the regret bound, and since we proved the regret bound, we conclude that the learning algorithm won’t seize control of the off-switch.
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem.
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.
We can at least try doing those things by just having specific channels through which human actions enter the system. For example, maybe it’s enough to focus on what the human posts on Facebook, so the AI just needs to look at that. The problem with this is, it leaves us open to attack vectors in which the channel in question is hijacked. On the other hand, even if we had a robust way to point to the human brain, we would still have attack vectors in which the human themself gets “hacked” somehow.
In principle, I can imagine solving these problems by somehow having a robust definition of “unhacked human”, which is what you’re going for, I think. But there might be a different type of solution in which we just avoid entering “corrupt” states in which the content of the channel diverges from what we intended. For example, this might be achievable by quantilizing imitation.
Thanks!!! After reading your comment and thinking about it more, here’s where I’m at:
Your “demonstration” thing was described as “The [AI] observes a human pursuing eir values and deduces the values from the behavior.”
When I read that, I was visualizing a robot and a human standing in a room, and the human is cooking, and the robot is watching the human and figuring out what the human is trying to do. And I was thinking that there needs to be some extra story for how that works, assuming that the robot has come to understand the world by building a giant unlabeled Bayes net world-model, and that it processes new visual inputs by slotting them into that model. (And that’s my normal assumption, since that’s how I think the neocortex works, and therefore that’s a plausible way that people might build AGI, and it’s the one I’m mainly focused on.)
So as the robot is watching the human soak lentils, the thing going on in its head is: “Pattern 957823, and Pattern 5672928, and Pattern 657192, and…”. In order to have the robot assign a special status to the human’s deliberate actions, we would need to find “the human’s deliberate actions” somewhere in the unlabeled world-model, i.e. solve a symbol-grounding problem, and doing so reliably is not straightforward.
However, maybe I was visualizing the wrong thing, with the robot and human in the room. Maybe I should have instead been visualizing a human using a computer via its keyboard. Then the AI can have a special input channel for the keystrokes that the human types. And every single one of those keystrokes is automatically treated as “the human’s deliberate action”. This seems to avoid the symbol-grounding problem I mentioned above. And if there’s a special input channel, we can use supervised learning to build a probabilistic model of that input channel. (I definitely think this step is compatible with the neocortical algorithm.) So now we have a human policy—i.e., what the AI thinks the human would do next, in any real or imagined circumstance, at least in terms of which keystrokes they would type. I’m still a bit hazy on what happens next in the plan—i.e., getting from that probabilistic model to the more abstract “what the human wants”. At least in general. And a big part of that is, again, symbol-grounding—as soon as we step away from the concrete predictions coming out of the “human keystroke probabilistic model”, we’re up in the land of “World-model Pattern #8673028″ etc. where we can’t really do anything useful. (I do see how the rest of the plan could work along these lines, where we install a second special human-to-AI information channel where the human says how things are going, and the AI builds a predictive model of that too, and then we predict-and-quantilize from the human policy.†)
It’s still worth noting that I, Steve, personally can be standing in a room with another human H2, watching them cook, and I can figure out what H2 is trying to do. And if H2 is someone I really admire, I will automatically start wanting to do the things that H2 is trying to do. So human social instincts do seem to have a way forward through the symbol-grounding path above, and not through the special-input-channel path, and I continue to think that this symbol-grounding method has something to do with empathetic simulation, but I’m hazy on the details, and I continue to think that it would be very good to understand better how exactly it works.
†This does seem to be a nice end-to-end story by the way. So have we solved the alignment problem? No… You mention channel corruption as a concern, and it is, but I’m even more concerned about this kind of design hitting a capabilities wall dramatically earlier than unsafe AGIs would. Specifically, I think it’s important that an AGI be able to do things like “come up with a new way to conceptualize the alignment problem”, and I think doing those things requires goal-seeking-RL-type exploration (e.g. exploring different possible mathematical formalizations or whatever) within a space of mental “actions” none of which it has ever seen a human take. I don’t think that this kind of AGI approach would be able to do that, but I could be wrong. That’s another reason that I’m hoping something good will come out of the symbol-grounding path informed by how human social instincts work.
Well, one thing you could try is using the AIT definition of goal-directedness to go from the policy to the utility function. However, in general it might require knowledge of the human’s counterfactual behavior which the AI doesn’t have. Maybe there are some natural assumption under which it is possible, but it’s not clear.
I feel the appeal of this intuition, but on the other hand, it might be a much easier problem since both of you are humans doing fairly “normal” human things. It is less obvious you would be able to watch something completely alien and unambiguously figure out what it’s trying to do.
To first approximation, it is enough for the AI to be more capable than us, since, whatever different solution we might come up with, an AI which is more capable than us would come up with a solution at least as good. Quantilizing from an imitation baseline seems like it should achieve that, since the baseline is “as capable as us” and arguably quantilization would produce significant improvement over that.
Instead of “actions the AI has seen a human take”, a better way to think about it is “actions the AI can confidently predict a human could take (with sufficient probability)”.
Thanks again for your very helpful response! I thought about the quantilization thing more, let me try again.
As background, to a first approximation, let’s say 5 times per second I (a human) “think a thought”. That involves a pair of two things:
(Possibly) update my world-model
(Possibly) take an action—in this case, type a key at the keyboard
Of these two things, the first one is especially important, because that’s where things get “figured out”. (Imagine staring into space while thinking about something.)
OK, now back to the AI. I can broadly imagine two strategies for a quantilization approach:
Build a model of the human policy from a superior epistemic vantage point: So here we give the AI its own world-model that needn’t have anything to do with the human’s, and likewise allow the AI to update its world-model in a way that needn’t have anything to do with how the human updates their world model. Then the AI leverages its superior world-model in the course of learning and quantilizing the human policy (maybe just the action part of the policy, or maybe both the actions and the world-model-updates, it doesn’t matter for the moment).
Straightforward human imitation: Here, we try to get to a place where the AI is learning about the world and figuring things out in a (quantilized) human-like way. So we want the AI to sample from the human policy for “taking an action”, and we want the AI to sample from the human policy for “updating the world-model”. And the AI doesn’t know anything about the world beyond what it learns through those quantilized-human-like world-model updates.
Start with the first one. If the AI is going to get to a superior epistemic vantage point, then it needs to “figure things out” about the world and concepts and so on, and as I said before, I think “figuring things out” requires goal-seeking-RL-type exploration (e.g. exploring different possible mathematical formalizations or whatever) within a space of mental “actions”. So we still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it. And since this is not a human-imitating system, we can’t fall back on that. So this doesn’t seem like we made much progress on the problem.
For the second one, well, I think I’m kinda more excited about this one.
Naively, it does seem hard though. Recall that in this approach we need to imitate both aspects of the human policy—plausibly-human actions, and plausibly-human world-model-updates. This seems hard, because the AI only sees the human’s actions, not its world-model updates. Can it infer the latter? I’m a bit pessimistic here, at least by default. Well, I’m optimistic that you can infer an underlying world-model from actions—based on e.g. GPT-3. But here, we’re not merely hoping to learn a snapshot of the human model, but also to learn all the human’s model-update steps. Intuitively, even when a human is talking to another human, it’s awfully hard to communicate the sequence of thoughts that led you to come up with an idea. Heck, it’s hard enough to understand how I myself figured something out, when it was in my own head five seconds ago. Another way to think about it is, you need a lot of data to constrain a world-model snapshot. So to constrain a world-model change, you presumably need a lot of data before the change, and a lot of data after the change. But “a lot of data” involves an extended period of time, which means there are thousands of sequential world-model changes all piled on top of each other, so it’s not a clean comparison.
A couple things that might help are (A) Giving the human a Kernel Flow or whatever and letting the AI access the data, and (B) Helping the inductive bias by running the AI on the same type of world-model data structure and inference algorithm as the human, and having it edit the model to get towards a place where its model and thought process exactly matches the human.
I’m weakly pessimistic that (A) would make much difference. I think (B) could help a lot, indeed I’m (weakly) hopeful that it could actually successfully converge towards the human thought process. And conveniently I also find (B) very technologically plausible.
So that’s neat. But we don’t have a superior epistemic vantage point anymore. So how do we quantilize? I figure, we can use some form of amplification—most simply, run the model at superhuman speeds so that it can “think longer” than the human on a given task. Or roll out different possible trains of thought in parallel, and ranking how well they turn out. Or something. But I feel like once we’re doing all that stuff, we can just throw out the quantilization part of the story, and instead our safety story can be that we’re starting with a deeply-human-like model and not straying too far from it, so hopefully it will remain well-behaved. That was my (non-quantilization) story here.
Sorry if I’m still confused; I’m very interested in your take, if you’re not sick of this discussion yet. :)
I gave a formal mathematical definition of (idealized) HDTL, so the answer to your question should probably be contained there. But I’m not entirely sure what it is since I don’t entirely understand the question.
The AI has a “superior epistemic vantage point” in the sense that, the prior ζ is richer than the prior that humans have. But, why do we “still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it”? The objective is fully specified.
A possible interpretation of your argument: a powerful AI would have to do something like TRL and access to the “envelope” computer can be unsafe in itself, because of possible side effects. That’s truly a serious problem! Essentially, it’s non-Cartesian daemons.
Atm I don’t have an extremely good solution to non-Cartesian daemons. Homomorphic cryptography can arguably solve it, but there’s large overhead. Possibly we can make do with some kind of obfuscation instead. Another vague idea I have is, make the AI avoid running computations which have side-effects predictable by the AI. In any case, more work is needed.
I don’t see why is it especially hard, it seems just like any system with unobservable degrees of freedom, which covers just about anything in the real world. So I would expect an AI with transformative capability to be able to do it. But maybe I’m just misunderstanding what you mean by this “approach number 2”. Perhaps you’re saying that it’s not enough to accurately predict the human actions, we need to have accurate pointers to particular gears inside the model. But I don’t think we do (maybe it’s because I’m following approach number 1).
Thanks, that was a helpful comment. I think we’re making progress, or at least I’m learning a lot here. :)
I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.
So for example, let’s talk about some useful non-obvious concept, like “informational entropy”. And let’s suppose that our AI cannot learn the concept of “informational entropy” from humans, because we’re in an alternate universe where humans haven’t yet invented the concept of informational entropy. (Or replace “informational entropy” with “some important not-yet-discovered concept in AI alignment.)
In that case, I see three possibilities.
First, the AI never winds up “knowing about” informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.)
Second (I think this is your model?): the AI’s prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that’s equivalent to informational entropy. In other words: it “discovers” informational entropy by dumb process of elimination.
Third (this is my model): we get a prior by running a “prior-building AI”. This prior-building AI has “agency”; it “actively” learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive patterns, and good intuitions around how to find useful new concepts, etc. At some point it notices some interesting and relevant patterns, attends to them, and after a few minutes of trial-and-error exploration it eventually invents the concept of informational entropy. This new concept (and its web of implications) then gets incorporated into the AI’s new “priors” going forward, allowing the AI to make better predictions and formulate better plans in the future, and to discover yet more predictively-useful concepts, etc. OK, now we let this “prior-building AI” run and run, building an ever-better “prior” (a.k.a. “world-model”). And then at some point we can turn this AI off, and export this “prior” into some other AI algorithm. (Alternatively, we could also more simply just have one AI which is both the “prior-building AI” and the AI that does, um, whatever we want our AIs to do.)
It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.
However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a “superior epistemic vantage point”) in practice, and that the other approaches can’t.
So, I was thinking about the third approach—and that’s why I said “we still have the whole AGI alignment / control problem” (i.e., aligning and controlling the “prior-building AI”). Does that help?
I think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. “the AI’s prior has a combinatorial explosion” is true but “dumb process of elimination” is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn’t mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally (which might correspond to refinement in the infra-Bayesian sense).
My thesis here is that if the AI satisfies a (carefully fleshed out in much more detail) version of the HDTL desideratum, then it is safe and capable. How to make an efficient algorithm that satisfies such a desideratum is another question, but that’s a question from a somewhat different domain: specifically the domain of developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms. I see the latter effort as to first approximation orthogonal to the effort of finding good formal desiderata for safe TAI (and, it also receives plenty of attention from outside the existential safety community).
Thanks!! Here’s where I’m at right now.
In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a “prior-building AI” with “agency” that is “trying” to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential “prior-building AI” that doesn’t also incidentally “try” to seize control of its off-switch.
Then in the parent comment you replied (IIUC) that if this is a problem at all, it’s not the problem you’re trying to solve (i.e. “finding good formal desiderata for safe TAI”), but a different problem (i.e. “developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms”), and my problem is “to a first approximation orthogonal” to your problem, and my problem “receives plenty of attention from outside the existential safety community”.
If so, my responses would be:
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people. :-P
I suspect that the problem doesn’t decompose the way you imply; instead I think that if we develop techniques for building a safe agential “prior-building AI”, we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever.
Even if that’s not true, I would still say that if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
I guess, when I read “learning algorithms with strong formal guarantees”, I’m imaging something like multi-armed bandit algorithms that have regret bounds. But I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
Hmmm, OK, let me try again.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
Sorry if I’m still misunderstanding. :)
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Good question!
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.
Thanks! I’m still thinking about this, but quick question: when you say “AIT definition of goal-directedness”, what does “AIT” mean?
Algorithmic Information Theory