Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.
Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.
Agreed on all points.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.