to be as corrigible as it rationally, Bayesianly should be
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
we have zero chance to build competent value learner on first try
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.