An AI agent can be narrowly focused, and given the specific goal by a human to try to find an improvement in an ml system. That ml system could happen to be its own code. A human desiring an impressively powerful AI system might do this. It does not follow that the two insights must occur together:
Here is how to make this code work better
the agent created by this code will not be well aligned with me
“it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself.”
Didn’t Eliezer make this argument years ago?
Insofar as goal changes are unpredictable, and make sense in retrospect, and insofar as we can empirically observe humans self-improving and changing their goals in the process, I do not find this compelling. He clearly no longer does, either.
Indeed, if this were guaranteed to be the case for all agents… then we wouldn’t have to worry about humans building unaligned agents more powerful than themselves. We’d realize that was a bad idea and simple not do it. Is that… what you’d like the gamble everything on? Or maybe… agents can do foolish things sometimes.
God, this was years and years ago. He essentially argued (recalling from memory) that if humans knew that installing an update would make them evil, but they aren’t evil now, they wouldn’t install the update, and wondered whether you could implement the same in AI to get AI to refuse intelligence gains if they would fuck over alignment. Technically extremely vague, and clearly ended up on the abandoned pile. I think the fact that you cannot predict your alignment shift, and that an alignment shift resulting from you being smarter may well be a correct alignment shift in hindsight, plus the trickiness of making an AI resist realignment when we are not sure whether we aligned it correctly in the first place, made it non-feasible for multiple reasons. I remember him arguing it in an informal blog article, and I do not recall much deeper arguments.
It’s all a matter of risk aversion, which no matter how I slice it feels kind of like a terminal value to me. An agent that only accepted exactly zero risk would be paralysed. An agent that doesn’t risks making mistakes; the less risk averse, the bigger the potential mistakes. Part of aligning an AI is determining how much risk averse it should be.
An AI agent can be narrowly focused, and given the specific goal by a human to try to find an improvement in an ml system. That ml system could happen to be its own code. A human desiring an impressively powerful AI system might do this. It does not follow that the two insights must occur together:
Here is how to make this code work better
the agent created by this code will not be well aligned with me
“it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself.”
Didn’t Eliezer make this argument years ago?
Insofar as goal changes are unpredictable, and make sense in retrospect, and insofar as we can empirically observe humans self-improving and changing their goals in the process, I do not find this compelling. He clearly no longer does, either.
Indeed, if this were guaranteed to be the case for all agents… then we wouldn’t have to worry about humans building unaligned agents more powerful than themselves. We’d realize that was a bad idea and simple not do it. Is that… what you’d like the gamble everything on? Or maybe… agents can do foolish things sometimes.
Couldn’t find a specific quote from Eliezer, but there is a tag “value drift”, and Scott Alexander’s story of Murder-Gandhi.
Quite curious to see Eliezer or someone else’s point on this subject, if you could point me in the right direction!
God, this was years and years ago. He essentially argued (recalling from memory) that if humans knew that installing an update would make them evil, but they aren’t evil now, they wouldn’t install the update, and wondered whether you could implement the same in AI to get AI to refuse intelligence gains if they would fuck over alignment. Technically extremely vague, and clearly ended up on the abandoned pile. I think the fact that you cannot predict your alignment shift, and that an alignment shift resulting from you being smarter may well be a correct alignment shift in hindsight, plus the trickiness of making an AI resist realignment when we are not sure whether we aligned it correctly in the first place, made it non-feasible for multiple reasons. I remember him arguing it in an informal blog article, and I do not recall much deeper arguments.
It’s all a matter of risk aversion, which no matter how I slice it feels kind of like a terminal value to me. An agent that only accepted exactly zero risk would be paralysed. An agent that doesn’t risks making mistakes; the less risk averse, the bigger the potential mistakes. Part of aligning an AI is determining how much risk averse it should be.