Thanks for writing this, I think it’s good to have discussions around these sorts of ideas.
Please, though, let’s not give up on “value alignment,” or, rather, conscience guard-railing, where the artificial conscience is inline with human values.
Sometimes when enough intelligent people declare something’s too hard to even try at, it becomes a self-fulfilling prophesy—most people may give up on it and then of course it’s never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we’re really not sure if it’ll be as hard as it seems.
This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.
However, we also need to be realistic if we are going to succeed.
We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.
One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.
However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they’re not a public-benefit corporation), and I also don’t think that value alignment is so convergent that order-following aligned AI is impossible to build. So we’re going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called “AI that resists malicious use”, while order-following AI is “AI that enables malicious use”. The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once “enabling malicious use” includes serious cybercrime, not just naughty stories, I don’t expect this political discussion to last very long: politically, it’s a pretty basic “do you want every-person-for-themself anarchy, or the collective good?” question. However, depending on takeoff speeds, the timeline from “serious cybercrime enabled” to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.
Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.
Thanks for writing this, I think it’s good to have discussions around these sorts of ideas.
Please, though, let’s not give up on “value alignment,” or, rather, conscience guard-railing, where the artificial conscience is inline with human values.
Sometimes when enough intelligent people declare something’s too hard to even try at, it becomes a self-fulfilling prophesy—most people may give up on it and then of course it’s never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we’re really not sure if it’ll be as hard as it seems.
This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.
However, we also need to be realistic if we are going to succeed.
We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.
One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.
For reasons I’ve outlined in Requirements for a Basin of Attraction to Alignment and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, I personally think value alignment is easy, convergent, and “an obvious target”, such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I’m not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV).
However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they’re not a public-benefit corporation), and I also don’t think that value alignment is so convergent that order-following aligned AI is impossible to build. So we’re going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called “AI that resists malicious use”, while order-following AI is “AI that enables malicious use”. The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once “enabling malicious use” includes serious cybercrime, not just naughty stories, I don’t expect this political discussion to last very long: politically, it’s a pretty basic “do you want every-person-for-themself anarchy, or the collective good?” question. However, depending on takeoff speeds, the timeline from “serious cybercrime enabled” to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.
Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.
Agreed on all points.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.