[Question] I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety.
What do you want to hear my nuanced takes on?
-What sorts of skills an AI would need to achieve powerbase ability -Timelines till powerbase ability -Takeoff speeds -How things are going to go by default, alignment-wise (e.g. will it be as described in this?) -How things are going to go by default, governance-wise (e.g. will it be as described in this?)
I have two questions I’d love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I’ll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
My pessimism means that my success stories usually involve us getting lucky; the main way this can happen is if some aspect of the problem is easier than I expect. From the technical point of view the main candidates ~OTMH are: i) deep learning generalizes super well so we can just learn preferences well enough, and success mostly becomes a matter dotting our “i”s and crossing our “t”s, and not fucking up. ii) it’s easy to build powerful AI genies/tools that are deceptive or power-seeking.
very quick / hot takes on the particular agendas: - IDA: “not even wrong”. The basic idea of getting AI systems to help you align more powerful AI systems is promising, but raises a chicken and egg problem. - debate: Unlike IDA, at least it is concrete enough you can pin it down. I like the complexity analogy, and the idea of pitting AI systems against each other in order to give them the right incentives. I don’t think anybody has shown how it even comes close to solving the problem though. - interpretability: worth doing as an extra “safety filter”, but it still seems like a lot of “just so stories”, and seems like it won’t scale Unless we have some conceptual insights into what makes a (simplified) explanation good. - natural abstraction hypothesis: I listened to Wentworth’s interview (On Daniel Filan’s podcast I think?). I’ve usually thought about such questions from a more “science of deep learning” point of view. I haven’t looked at the technical details of Wentworth’s work at all, But it seems like a sensible direction. I don’t have a story for how this is supposed to solve alignment. Seems more like agent foundations work; I view such work as Unlikely to pay off much in short timelines, but Something we definitely want to have made a best effort at given the time. I view such work as falling into the assurance category; This category of work is important and neglected, but also less valuable if You plan on getting lucky. - ERO: Don’t know what it is - Shard Theory: Also don’t know what this is. Is it something about like… underspecification / ontological crisis? Those topics are quite mainstream in machine learning, and IMO, People interested in them should engage fully with ML. - relaxed adversarial training: I sort of forget what the relaxed part refers to. another topic where it seems like you should go all in on ML if you want to work on it. I think there’s tons of robustness work in ML already, And personally I’m not too excited about it, Put a big part of that is because it’s too focused on Lp adversarial robustness, which is a weird cottage industry, practical relevance and fruitfulness I am pretty skeptical of. I think It makes sense to consider a more general notion of adversary. Oh wait, is the relaxed adversarial training where you Try to train a model not to have bad thoughts? That seems like a Hail Mary approach that is most likely to hide problems rather than solve them. https://twitter.com/DavidSKrueger/status/1565395291410423810 - ELK: plausibly a good framing (I like that it is concrete) for attempting to produce conceptual insights about interpretability. I would really like to see attempts to formalize the problem rigorously.
One quick follow-up question, when you say “build powerful AI tools that are deceptive” as a way of “the problem being easier than anticipated”, how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is “We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. ”
“The basic idea of relaxed adversarial training is something like:
A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test—for example, one that’s computationally hard to produce
This makes generating adversarial examples that trigger the defection very hard
Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect”
ERO is another one of those things that I think has a lot of precedent in ML (There’s a paper I was trying to dig up using natural language as the latent space in a variational autoencoder), but doesn’t look very promising to me because of “steganography everywhere”. Like other approaches to interpretability, it seems worth pursuing, but I also worry that people believe too much in flawed interpretations.
Shard theory sounds like an interesting framing, and again something that a lot of people would already agree with in ML, I’m not sure what it is supposed to be useful for or what sort of testable predictions it makes. Seems like a perspective worth keeping in mind, But I’m not sure I’d call it a research agenda.
RAT: I don’t see any way to “search over something like ‘the model’s beliefs about what it has seen’”; This seems like a potential sticking point; There’s more foundational research needed to figure out if when how we can even ascribe beliefs to a model etc.
As a general comment, I think a lot of the “agendas” That people discuss here are not very well fleshed out, And the details are crucially important. I’m not even sure whether to call a lot of these ideas “agendas”. To need they strike me as more like “framings”. It is important to note that the ML community doesn’t publish “framings”, except when they can be supported by concrete results (You can sometimes find people giving their perspective or framing on some problems in machine learning in blogs or keynotes or tutorials, etc.). So I think that people here often overestimate the novelty of their perspective. I think it is good to reward people for sharing these things, but given that a lot of other people might have similar thoughts but choose not to share them, I don’t think people here have quite the right attitude towards this. Writing up or otherwise communicating such framings without a strong empirical or theoretical contribution and expecting credit for the ideas / citations of your work would be considered “flag planting” in machine learning. Probably the best would be some sort of middle ground.
ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?
Shard Theory: Yeah, the word research agenda was maybe wrongly picked. I was mainly trying to refer to research directions/frameworks.
RAT: Agree at the moment this is not feasible.
See above, I don’t have strong views on how to call this. Probably for some things research agenda might be too strong of a word. I appreciate your general comment, this is helpful in better understanding your view on Lesswrong vs. for example peer-reviewing. I think you are right to some degree. There is a lot of content that is mostly about framing and does not provide concrete results. However, I think that sometimes a correct framing is needed for people to actually come up with interesting results, and for making things more concrete. Some examples I like for example are the inner/outer alignment framing (which I think initially didn’t bring any concrete examples), or the recent Simulators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) post. I think in those cases the right framing helps tremendously to make progress with concrete research afterward. Although I agree that grounded, concrete, and result-oriented experimentation is indeed needed to make concrete progress on a problem. So I do understand your point, and it can feel like flag planting in some cases.
Note: I’m also coming from academia, so I definitely understand your view and share it to some degree. However, I’ve personally come to appreciate some posts (usually by great researchers) that allow me to think about the Alignment Problem in a different way. I read “Film Study for Research” just the other day (https://bounded-regret.ghost.io/film-study/, recommended by Jacob Steinhardt). In retrospect I realized that a lot of the posts here give a window into the rather “raw & unfiltered thinking process” of various researchers, which I think is a great way to practice research film study.
My understanding is that process-based optimization is just another name for supervising intermediary computations—you can treat anything computed by a network as an “output” in the sense of applying some loss function.
One quick follow-up question, when you say “build powerful AI tools that are deceptive” as a way of “the problem being easier than anticipated”, how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
I feel like I’m pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI’s self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn’t about what’s in the loss function.
People have had a go at outer alignment too, but (if they’re named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.
-What sorts of skills an AI would need to achieve powerbase ability
-Timelines till powerbase ability
-Takeoff speeds
-How things are going to go by default, alignment-wise (e.g. will it be as described in this?)
-How things are going to go by default, governance-wise (e.g. will it be as described in this?)
I have two questions I’d love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I’ll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
No overarching agenda, but I do have some directions I’m more keen on pushing, see: https://www.davidscottkrueger.com/
I’m pessimistic overall, and don’t find any of the agendas particularly promising. I’m a strong proponent of the portfolio approach (https://futureoflife.org/2017/08/17/portfolio-approach-to-ai-safety-research/).
My pessimism means that my success stories usually involve us getting lucky; the main way this can happen is if some aspect of the problem is easier than I expect. From the technical point of view the main candidates ~OTMH are: i) deep learning generalizes super well so we can just learn preferences well enough, and success mostly becomes a matter dotting our “i”s and crossing our “t”s, and not fucking up. ii) it’s easy to build powerful AI genies/tools that are deceptive or power-seeking.
very quick / hot takes on the particular agendas:
- IDA: “not even wrong”. The basic idea of getting AI systems to help you align more powerful AI systems is promising, but raises a chicken and egg problem.
- debate: Unlike IDA, at least it is concrete enough you can pin it down. I like the complexity analogy, and the idea of pitting AI systems against each other in order to give them the right incentives. I don’t think anybody has shown how it even comes close to solving the problem though.
- interpretability: worth doing as an extra “safety filter”, but it still seems like a lot of “just so stories”, and seems like it won’t scale Unless we have some conceptual insights into what makes a (simplified) explanation good.
- natural abstraction hypothesis: I listened to Wentworth’s interview (On Daniel Filan’s podcast I think?). I’ve usually thought about such questions from a more “science of deep learning” point of view. I haven’t looked at the technical details of Wentworth’s work at all, But it seems like a sensible direction. I don’t have a story for how this is supposed to solve alignment. Seems more like agent foundations work; I view such work as Unlikely to pay off much in short timelines, but Something we definitely want to have made a best effort at given the time. I view such work as falling into the assurance category; This category of work is important and neglected, but also less valuable if You plan on getting lucky.
- ERO: Don’t know what it is
- Shard Theory: Also don’t know what this is. Is it something about like… underspecification / ontological crisis? Those topics are quite mainstream in machine learning, and IMO, People interested in them should engage fully with ML.
- relaxed adversarial training: I sort of forget what the relaxed part refers to. another topic where it seems like you should go all in on ML if you want to work on it. I think there’s tons of robustness work in ML already, And personally I’m not too excited about it, Put a big part of that is because it’s too focused on Lp adversarial robustness, which is a weird cottage industry, practical relevance and fruitfulness I am pretty skeptical of. I think It makes sense to consider a more general notion of adversary. Oh wait, is the relaxed adversarial training where you Try to train a model not to have bad thoughts? That seems like a Hail Mary approach that is most likely to hide problems rather than solve them. https://twitter.com/DavidSKrueger/status/1565395291410423810
- ELK: plausibly a good framing (I like that it is concrete) for attempting to produce conceptual insights about interpretability. I would really like to see attempts to formalize the problem rigorously.
Thanks for your thoughts, really appreciate it.
One quick follow-up question, when you say “build powerful AI tools that are deceptive” as a way of “the problem being easier than anticipated”, how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
Here are some links to the concepts you asked about.
Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for . The main idea is to use Chain-of-though reasoning to oversee the thought processes of your model (assuming that those thought processes are complete and straightforward, and the output causally depends on it).
Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is “We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. ”
Relaxed Adversarial Training: I think the main post is this one https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment . But I really like the short description by Beth (https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability):
“The basic idea of relaxed adversarial training is something like:
A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test—for example, one that’s computationally hard to produce
This makes generating adversarial examples that trigger the defection very hard
Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect”
ERO is another one of those things that I think has a lot of precedent in ML (There’s a paper I was trying to dig up using natural language as the latent space in a variational autoencoder), but doesn’t look very promising to me because of “steganography everywhere”. Like other approaches to interpretability, it seems worth pursuing, but I also worry that people believe too much in flawed interpretations.
Shard theory sounds like an interesting framing, and again something that a lot of people would already agree with in ML, I’m not sure what it is supposed to be useful for or what sort of testable predictions it makes. Seems like a perspective worth keeping in mind, But I’m not sure I’d call it a research agenda.
RAT: I don’t see any way to “search over something like ‘the model’s beliefs about what it has seen’”; This seems like a potential sticking point; There’s more foundational research needed to figure out if when how we can even ascribe beliefs to a model etc.
As a general comment, I think a lot of the “agendas” That people discuss here are not very well fleshed out, And the details are crucially important. I’m not even sure whether to call a lot of these ideas “agendas”. To need they strike me as more like “framings”. It is important to note that the ML community doesn’t publish “framings”, except when they can be supported by concrete results (You can sometimes find people giving their perspective or framing on some problems in machine learning in blogs or keynotes or tutorials, etc.). So I think that people here often overestimate the novelty of their perspective. I think it is good to reward people for sharing these things, but given that a lot of other people might have similar thoughts but choose not to share them, I don’t think people here have quite the right attitude towards this. Writing up or otherwise communicating such framings without a strong empirical or theoretical contribution and expecting credit for the ideas / citations of your work would be considered “flag planting” in machine learning. Probably the best would be some sort of middle ground.
ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?
Shard Theory: Yeah, the word research agenda was maybe wrongly picked. I was mainly trying to refer to research directions/frameworks.
RAT: Agree at the moment this is not feasible.
See above, I don’t have strong views on how to call this. Probably for some things research agenda might be too strong of a word. I appreciate your general comment, this is helpful in better understanding your view on Lesswrong vs. for example peer-reviewing. I think you are right to some degree. There is a lot of content that is mostly about framing and does not provide concrete results. However, I think that sometimes a correct framing is needed for people to actually come up with interesting results, and for making things more concrete. Some examples I like for example are the inner/outer alignment framing (which I think initially didn’t bring any concrete examples), or the recent Simulators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) post. I think in those cases the right framing helps tremendously to make progress with concrete research afterward. Although I agree that grounded, concrete, and result-oriented experimentation is indeed needed to make concrete progress on a problem. So I do understand your point, and it can feel like flag planting in some cases.
Note: I’m also coming from academia, so I definitely understand your view and share it to some degree. However, I’ve personally come to appreciate some posts (usually by great researchers) that allow me to think about the Alignment Problem in a different way.
I read “Film Study for Research” just the other day (https://bounded-regret.ghost.io/film-study/, recommended by Jacob Steinhardt). In retrospect I realized that a lot of the posts here give a window into the rather “raw & unfiltered thinking process” of various researchers, which I think is a great way to practice research film study.
My understanding is that process-based optimization is just another name for supervising intermediary computations—you can treat anything computed by a network as an “output” in the sense of applying some loss function.
So (IIUC), it is not qualitatively different.
That was a typo, should say “are NOT deceptive”
importance / difficulty of outer vs inner alignment
outlining some research directions that seem relatively promising to you, and explain why they seem more promising than others
I feel like I’m pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI’s self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn’t about what’s in the loss function.
People have had a go at outer alignment too, but (if they’re named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.