It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem.
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Good question!
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.