Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
...if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
...I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
My model of Vanessa: We’re hopefully gonna find a learning algorithm with a provable regret bound (or something like that). Since seizing control of the off-switch would be very bad according to the objective function and thus violate the regret bound, and since we proved the regret bound, we conclude that the learning algorithm won’t seize control of the off-switch.
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem.
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
Hmmm, OK, let me try again.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
Sorry if I’m still misunderstanding. :)
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Good question!
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.