In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common.
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well.
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
I’m not quite sure what long-term error you have in mind
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.