why do we “still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it”? The objective is fully specified…
Thanks, that was a helpful comment. I think we’re making progress, or at least I’m learning a lot here. :)
I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.
So for example, let’s talk about some useful non-obvious concept, like “informational entropy”. And let’s suppose that our AI cannot learn the concept of “informational entropy” from humans, because we’re in an alternate universe where humans haven’t yet invented the concept of informational entropy. (Or replace “informational entropy” with “some important not-yet-discovered concept in AI alignment.)
In that case, I see three possibilities.
First, the AI never winds up “knowing about” informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.)
Second (I think this is your model?): the AI’s prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that’s equivalent to informational entropy. In other words: it “discovers” informational entropy by dumb process of elimination.
Third (this is my model): we get a prior by running a “prior-building AI”. This prior-building AI has “agency”; it “actively” learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive patterns, and good intuitions around how to find useful new concepts, etc. At some point it notices some interesting and relevant patterns, attends to them, and after a few minutes of trial-and-error exploration it eventually invents the concept of informational entropy. This new concept (and its web of implications) then gets incorporated into the AI’s new “priors” going forward, allowing the AI to make better predictions and formulate better plans in the future, and to discover yet more predictively-useful concepts, etc. OK, now we let this “prior-building AI” run and run, building an ever-better “prior” (a.k.a. “world-model”). And then at some point we can turn this AI off, and export this “prior” into some other AI algorithm. (Alternatively, we could also more simply just have one AI which is both the “prior-building AI” and the AI that does, um, whatever we want our AIs to do.)
It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.
However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a “superior epistemic vantage point”) in practice, and that the other approaches can’t.
So, I was thinking about the third approach—and that’s why I said “we still have the whole AGI alignment / control problem” (i.e., aligning and controlling the “prior-building AI”). Does that help?
I think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. “the AI’s prior has a combinatorial explosion” is true but “dumb process of elimination” is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn’t mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally (which might correspond to refinement in the infra-Bayesian sense).
My thesis here is that if the AI satisfies a (carefully fleshed out in much more detail) version of the HDTL desideratum, then it is safe and capable. How to make an efficient algorithm that satisfies such a desideratum is another question, but that’s a question from a somewhat different domain: specifically the domain of developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms. I see the latter effort as to first approximation orthogonal to the effort of finding good formal desiderata for safe TAI (and, it also receives plenty of attention from outside the existential safety community).
In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a “prior-building AI” with “agency” that is “trying” to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential “prior-building AI” that doesn’t also incidentally “try” to seize control of its off-switch.
Then in the parent comment you replied (IIUC) that if this is a problem at all, it’s not the problem you’re trying to solve (i.e. “finding good formal desiderata for safe TAI”), but a different problem (i.e. “developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms”), and my problem is “to a first approximation orthogonal” to your problem, and my problem “receives plenty of attention from outside the existential safety community”.
If so, my responses would be:
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people. :-P
I suspect that the problem doesn’t decompose the way you imply; instead I think that if we develop techniques for building a safe agential “prior-building AI”, we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever.
Even if that’s not true, I would still say that if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
I guess, when I read “learning algorithms with strong formal guarantees”, I’m imaging something like multi-armed bandit algorithms that have regret bounds. But I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
...if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
...I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
My model of Vanessa: We’re hopefully gonna find a learning algorithm with a provable regret bound (or something like that). Since seizing control of the off-switch would be very bad according to the objective function and thus violate the regret bound, and since we proved the regret bound, we conclude that the learning algorithm won’t seize control of the off-switch.
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem.
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.
Thanks, that was a helpful comment. I think we’re making progress, or at least I’m learning a lot here. :)
I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.
So for example, let’s talk about some useful non-obvious concept, like “informational entropy”. And let’s suppose that our AI cannot learn the concept of “informational entropy” from humans, because we’re in an alternate universe where humans haven’t yet invented the concept of informational entropy. (Or replace “informational entropy” with “some important not-yet-discovered concept in AI alignment.)
In that case, I see three possibilities.
First, the AI never winds up “knowing about” informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.)
Second (I think this is your model?): the AI’s prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that’s equivalent to informational entropy. In other words: it “discovers” informational entropy by dumb process of elimination.
Third (this is my model): we get a prior by running a “prior-building AI”. This prior-building AI has “agency”; it “actively” learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive patterns, and good intuitions around how to find useful new concepts, etc. At some point it notices some interesting and relevant patterns, attends to them, and after a few minutes of trial-and-error exploration it eventually invents the concept of informational entropy. This new concept (and its web of implications) then gets incorporated into the AI’s new “priors” going forward, allowing the AI to make better predictions and formulate better plans in the future, and to discover yet more predictively-useful concepts, etc. OK, now we let this “prior-building AI” run and run, building an ever-better “prior” (a.k.a. “world-model”). And then at some point we can turn this AI off, and export this “prior” into some other AI algorithm. (Alternatively, we could also more simply just have one AI which is both the “prior-building AI” and the AI that does, um, whatever we want our AIs to do.)
It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.
However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a “superior epistemic vantage point”) in practice, and that the other approaches can’t.
So, I was thinking about the third approach—and that’s why I said “we still have the whole AGI alignment / control problem” (i.e., aligning and controlling the “prior-building AI”). Does that help?
I think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. “the AI’s prior has a combinatorial explosion” is true but “dumb process of elimination” is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn’t mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally (which might correspond to refinement in the infra-Bayesian sense).
My thesis here is that if the AI satisfies a (carefully fleshed out in much more detail) version of the HDTL desideratum, then it is safe and capable. How to make an efficient algorithm that satisfies such a desideratum is another question, but that’s a question from a somewhat different domain: specifically the domain of developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms. I see the latter effort as to first approximation orthogonal to the effort of finding good formal desiderata for safe TAI (and, it also receives plenty of attention from outside the existential safety community).
Thanks!! Here’s where I’m at right now.
In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a “prior-building AI” with “agency” that is “trying” to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential “prior-building AI” that doesn’t also incidentally “try” to seize control of its off-switch.
Then in the parent comment you replied (IIUC) that if this is a problem at all, it’s not the problem you’re trying to solve (i.e. “finding good formal desiderata for safe TAI”), but a different problem (i.e. “developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms”), and my problem is “to a first approximation orthogonal” to your problem, and my problem “receives plenty of attention from outside the existential safety community”.
If so, my responses would be:
Obviously the problem of “make an agential “prior-building AI” that doesn’t try to seize control of its off-switch” is being worked on almost exclusively by x-risk people. :-P
I suspect that the problem doesn’t decompose the way you imply; instead I think that if we develop techniques for building a safe agential “prior-building AI”, we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever.
Even if that’s not true, I would still say that if we can make a safe agential “prior-building AI” that gets to human-level predictive ability and beyond, then we’ve solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those “learning algorithms with strong formal guarantees” will turn out not to exist, at least not at human-level capabilities.
I guess, when I read “learning algorithms with strong formal guarantees”, I’m imaging something like multi-armed bandit algorithms that have regret bounds. But I’m having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don’t know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
Umm, obviously I did not claim it isn’t. I just decomposed the original problem in a different way that didn’t single out this part.
Maybe? I’m not quite sure what you mean by “prior building AI” and whether it’s even possible to apply a “microscope” to something superhuman, or that this approach is easier than other approaches, but I’m not necessarily ruling it out.
That’s where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description.
I don’t understand your argument here. When I prove a theorem that “for all x: P(x)”, I don’t need to be able to imagine every possible value of x. That’s the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn’t a problem for them either.
Hmmm, OK, let me try again.
You wrote earlier: “the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally”.
My claim is that good-enough algorithms for “adding more and more detail incrementally” will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for “adding more and more detail incrementally” may be algorithms that are (loosely speaking) “trying” to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I’m currently imagining you responding to that would be:
(If that’s not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that’s akin to putting “the AGI will be safe” as a desideratum, which pushes “solve AGI safety” onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That’s perfectly fine, and indeed precisely defining “safe” is very useful. It’s only a problem if we also claim that the “find a learning algorithm that satisfies the desiderata” part is not an AGI safety problem. (Also, if we divide the problem this way, then “we can’t find a provably-safe AGI design” would be re-cast as “no human-level learning algorithms satisfy the desiderata”.)
That’s also where I was coming from when I expressed skepticism about “strong formal guarantees”. We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like “I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren’t very good at playing Go.” If you weren’t thinking of performance guarantees, what “formal guarantees” are you thinking of?
(For what little it’s worth, I’d be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. “this algorithm won’t seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah”.)
Sorry if I’m still misunderstanding. :)
I never said it’s not a safety problem. I only said that a lot progress on this can come from research that is not very “safety specific”. I would certainly work on it if “precisely defining safe” was already solved.
Yes, we don’t have these things. That doesn’t mean these things don’t exist. Surely all research is about going from “not having” things to “having” things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn’t have to be anything like a “clean” implementation of a particular algorithm. But that’s besides the point.)
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled “Lemma 1: This learning algorithm will be aligned”.
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda “agential”, and if an “agential” algorithm is not “trying” to perform well on the objective, then it probably won’t perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we’ve already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
I don’t understand what Lemma 1 is if it’s not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Good question!
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as “goals”, and then makes plans to advance those “goals”. (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is “aligned” if the things flagged as “goals” do in fact corresponding to maximizing the objective function (e.g. “predict the human’s outputs”), or at least it’s as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it’s hard to talk rigorously about symbol-grounding, but maybe it’s not impossible. And if so, I would say that this is a definition of “aligned” which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it’s possible that this definition of “aligned” would be equivalent to a performance-centric claim along the lines of “asymptotically, performance goes up not down”. But I’m not sure that it’s exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
We prove that the algorithm is aligned (in the above sense) via “direct reasoning about alignment” (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
We prove that the algorithm satisfies the asymptotic performance guarantee via “direct reasoning about performance”, and then a corollary of that proof would be that the algorithm is aligned (in the above sense).
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it’s solvable at all, it would only be solvable with the help of various specific “alignment-promoting algorithm features”, and we won’t be able to prove that those features work except by “direct reasoning about alignment”.
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the “inner MDP” of a particular “metastate” effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1]. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don’t think that anywhere there we will need a lemma saying that the algorithm picks “aligned” goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on “child” MDPs to actions of “parent” MDP that is compatible with the transition kernel.