Thanks, strong upvote, this is especially clarifying.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
The weakly alignable baseline should be “marginally better than humans”. Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i’m quite uncertain, but it’s probably considerably higher). Ideally we should always have an MVP alignment solution in place.
My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:
DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI’s power by constraining it’s environment. For example we have nothing to fear from AGI’s trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I’m somewhat optimistic that it’s no more complex than other major brain systems we’ve already mostly reverse engineered.
The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
I don’t know what “strongly alignable”, “robust, high certainty paradigm”, or “approximately/weakly alignable” mean here. As I said in another comment:
There are two problems here:
Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.
Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it’s something humanity can solve at its leisure once we’ve solved #1 and therefore aren’t at immediate risk of destroying ourselves.
If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not ‘what’s possible in principle, given arbitrarily large amounts of time?‘, but rather ‘what can we do in practice to actually end the acute risk period / ensure we don’t blow ourselves up in the immediate future?’.
(Where I’m imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
The weakly alignable baseline should be “marginally better than humans”.
I don’t understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I’m thinking in terms of ‘what’s the least difficult-to-align act humanity could attempt with an AGI?’.
Maybe you mean something different by “marginally better than humans”?
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found.
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’), not a Problem #1 research direction (‘we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well’).
For what it’s worth I’m cautiously optimistic that “reverse-engineering the circuits underlying empathy/love/altruism/etc.” is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you’re interested.
Maybe you mean something different by “marginally better than humans”?
No I meant “merely as aligned as a human”. Which is why I used “approximately/weakly” aligned—as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred.
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’),
Alright so now I’m guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to—let’s just call that DLA—may take subjective centuries, which thus suggests that you believe:
That DLA is significantly more difficult than DL AGI in general
That uploading is likewise significantly more difficult
or perhaps
DLA isn’t necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn’t effective) comes first
Thanks, strong upvote, this is especially clarifying.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
The weakly alignable baseline should be “marginally better than humans”. Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i’m quite uncertain, but it’s probably considerably higher). Ideally we should always have an MVP alignment solution in place.
My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:
DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI’s power by constraining it’s environment. For example we have nothing to fear from AGI’s trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I’m somewhat optimistic that it’s no more complex than other major brain systems we’ve already mostly reverse engineered.
The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.
I don’t know what “strongly alignable”, “robust, high certainty paradigm”, or “approximately/weakly alignable” mean here. As I said in another comment:
If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not ‘what’s possible in principle, given arbitrarily large amounts of time?‘, but rather ‘what can we do in practice to actually end the acute risk period / ensure we don’t blow ourselves up in the immediate future?’.
(Where I’m imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
I don’t understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I’m thinking in terms of ‘what’s the least difficult-to-align act humanity could attempt with an AGI?’.
Maybe you mean something different by “marginally better than humans”?
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’), not a Problem #1 research direction (‘we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well’).
For what it’s worth I’m cautiously optimistic that “reverse-engineering the circuits underlying empathy/love/altruism/etc.” is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you’re interested.
No I meant “merely as aligned as a human”. Which is why I used “approximately/weakly” aligned—as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.
Alright so now I’m guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to—let’s just call that DLA—may take subjective centuries, which thus suggests that you believe:
That DLA is significantly more difficult than DL AGI in general
That uploading is likewise significantly more difficult
or perhaps
DLA isn’t necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn’t effective) comes first
Is any of that right?
Sounds right, yeah!