For sanity reasons, I tried writing down my thoughts quickly about AI alignment and where they actually come from. I know a bunch about various parts of various alignment agendas from reading posts and attending talks and I sometimes can hold my own in conversations about them. However, a lot of my actual thoughts about how I expect things to turn out are super fuzzy, sound dumb, and are very much based on deferring to people and being very confused.
Why is AI existential risk a big deal
Big models seem competent at various things. I currently predict that in 2023, I will see more impressive advances in AI than in 2022. I am confused about how exactly strategic awareness will arise or be implemented in AI systems but it does feel like there are strong incentives to make that and other important capabilities happen. Just because these systems can appear competent, for example chatgpt for the most part gives good answers to questions in a way that OpenAI would want, doesn’t mean that they will be aimed at the right thing. I am not sure how actually agency or aims would arise/be implemented in systems future systems that share similarities with GPT but I expect whatever they are aimed at to not be conducive to human survival and flourishing. I expect things that are actually bad for the model seeming competent (eg: it outputs violent text when OpenAI doesn’t want it to or fails to bring you coffee if that’s what you want) to get papered over while deeper questions of it actually valuing what humans would value to be something that AGI labs are too confused to ask and resolve. It seems really hard, naively, to point our systems as they get more agency and strategic awareness, at the right things instead of pursuing casual fixes that just make it look like the model is doing what we want.
A lot of my thoughts are also influenced by discussions on deceptive alignment. The arguments to me make sense. However, because of the lack of empirical evidence, I am relying heavily on my intuitions of powerful AI systems being consequentialists (because that’s good for pursuing goals) and also on people who have thought about this more continuing to be concerned and my not having encountered arguments against this being a thing. Under my model, deceptive alignment is a thing that is _expected_ to happen, not just a worrying possibility but I feel like thinking and reading more on what future AI systems will look like concretely could change how plausible I find this (in either direction).
Forecasting AI
I feel like people often talk about timelines in confusing ways. Intuitively, it feels to me like we are ~7-30 years away (with lots of uncertainty) but I don’t know how to gain more evidence to become more confident. I am further confused because a lot of people rely on the bioanchors model as a starting point whereas to me it is not much evidence for anything besides “it is not implausible we get transformative AI this century”. I expect thinking forward from existing AI capabilities to shift my intuitions more but this feels weird and too inside-viewy in a way I feel like I don’t have the license to be because I don’t expect to be able to come up with good explanations for why AGI is soon or later.
About takeoff: when I read Paul on slow takeoff, most of his arguments seem convincing to me. When I read Eliezer on fast takeoff, most of his arguments seem convincing to me. I expect this is one topic where if I just started writing my thoughts on what I think about various arguments I’ve read, this would be useful for me in helping generate some insights and shifting my views.
How hard is alignment
I sometimes hang around with people who are working on alignment and normally give p(doom) around ~25%. They say this is because of their intuitions about how hard a problem like this is to solve and their thoughts around what sorts of things AGI labs will be willing to try and implement. I think my own p(doom) would be higher than that because I assign higher credence to the problem just being _much_ more difficult than our ability to solve it for technical as well as coordination reasons. This depends on other considerations such as what exactly progress in AI capabilities will look like.
However, the position of alignment actually not being that difficult also sometimes sounds convincing to me depending on what I am reading or whose talks I am attending. I have some intuitions about why OpenAI’s alignment plan wouldn’t work out but I, unfortunately, haven’t thought about this hard enough to explain exactly why while someone red-teams my answers. So I don’t know, doesn’t seem _that_ implausible to me that we could just get lucky with how hard alignment is.
My current model of many people working on alignment is that they’re trying to solve problems that seem like they would be helpful to solve (eg: mechanistic interpretability, model evaluations, etc.) but don’t have an alignment _plan_. This is as expected since having an alignment plan means we have something that we think would solve the problem if it works out. I think Paul Christiano does have a plan in mind. I am currently trying to understand whatever ARC has put out because other people I talk to think Paul is very smart and defer to him a lot.
Things I want to do to figure out my thoughts on things:
Understand the goals of people working on mechanistic interpretability enough to have thoughts on how I expect mechanistic interpretability to progress and how useful I expect it to be
Read ARC’s stuff and form thoughts on if work on the agenda goes really well if this would solve the alignment problem
Think harder about what progress in AI capabilities looks like and what capabilities come in what order
Figure out why Daniel Kokotajlo relies a bunch on the bioanchors model for _his_ forecasts
Understand John Wentworth’s The Plan to figure out if it is any good
Red-team OpenAI’s safety plan without deferring to other people
Read the 2021 MIRI conversations properly
Write my thoughts more clearly as I do the above
I think the above will be useful. I expect for decision-making reasons, I will continue to act based on deferring to people who seem reasonable and whom other people in my circles who seem more knowledgeable and smarter than me defer to.
A lot of reasonable AI alignment ideas are only going to be relevant post-singularity, and changing understanding of timelines keeps reshuffling them out of current relevance. Turns out LLM human imitations very likely can be very capable on their own, without a separate AGI needed to build them up. AI alignment is so poorly understood that any capable AIs that are not just carefully chosen human imitations are going to be more dangerous, that seems like a solid bet for the next few years before LLMs go full AGI. So alignment concerns about such AIs (that are not just human imitations) should be post-singularity concerns for the human imitations to worry about.
Oh and it also feels like some other things could be even more important for me to think about but I forgot to mention them because I rarely have conversations with people about those things so they feel less salient. Things such as s-risks, governance stuff even if we solve the technical challenge of alignment, what conclusions I can make from the fact that lots of people in alignment disagree pretty deeply with each other.
For sanity reasons, I tried writing down my thoughts quickly about AI alignment and where they actually come from. I know a bunch about various parts of various alignment agendas from reading posts and attending talks and I sometimes can hold my own in conversations about them. However, a lot of my actual thoughts about how I expect things to turn out are super fuzzy, sound dumb, and are very much based on deferring to people and being very confused.
Why is AI existential risk a big deal
Big models seem competent at various things. I currently predict that in 2023, I will see more impressive advances in AI than in 2022. I am confused about how exactly strategic awareness will arise or be implemented in AI systems but it does feel like there are strong incentives to make that and other important capabilities happen. Just because these systems can appear competent, for example chatgpt for the most part gives good answers to questions in a way that OpenAI would want, doesn’t mean that they will be aimed at the right thing. I am not sure how actually agency or aims would arise/be implemented in systems future systems that share similarities with GPT but I expect whatever they are aimed at to not be conducive to human survival and flourishing. I expect things that are actually bad for the model seeming competent (eg: it outputs violent text when OpenAI doesn’t want it to or fails to bring you coffee if that’s what you want) to get papered over while deeper questions of it actually valuing what humans would value to be something that AGI labs are too confused to ask and resolve. It seems really hard, naively, to point our systems as they get more agency and strategic awareness, at the right things instead of pursuing casual fixes that just make it look like the model is doing what we want.
A lot of my thoughts are also influenced by discussions on deceptive alignment. The arguments to me make sense. However, because of the lack of empirical evidence, I am relying heavily on my intuitions of powerful AI systems being consequentialists (because that’s good for pursuing goals) and also on people who have thought about this more continuing to be concerned and my not having encountered arguments against this being a thing. Under my model, deceptive alignment is a thing that is _expected_ to happen, not just a worrying possibility but I feel like thinking and reading more on what future AI systems will look like concretely could change how plausible I find this (in either direction).
Forecasting AI
I feel like people often talk about timelines in confusing ways. Intuitively, it feels to me like we are ~7-30 years away (with lots of uncertainty) but I don’t know how to gain more evidence to become more confident. I am further confused because a lot of people rely on the bioanchors model as a starting point whereas to me it is not much evidence for anything besides “it is not implausible we get transformative AI this century”. I expect thinking forward from existing AI capabilities to shift my intuitions more but this feels weird and too inside-viewy in a way I feel like I don’t have the license to be because I don’t expect to be able to come up with good explanations for why AGI is soon or later.
About takeoff: when I read Paul on slow takeoff, most of his arguments seem convincing to me. When I read Eliezer on fast takeoff, most of his arguments seem convincing to me. I expect this is one topic where if I just started writing my thoughts on what I think about various arguments I’ve read, this would be useful for me in helping generate some insights and shifting my views.
How hard is alignment
I sometimes hang around with people who are working on alignment and normally give p(doom) around ~25%. They say this is because of their intuitions about how hard a problem like this is to solve and their thoughts around what sorts of things AGI labs will be willing to try and implement. I think my own p(doom) would be higher than that because I assign higher credence to the problem just being _much_ more difficult than our ability to solve it for technical as well as coordination reasons. This depends on other considerations such as what exactly progress in AI capabilities will look like.
However, the position of alignment actually not being that difficult also sometimes sounds convincing to me depending on what I am reading or whose talks I am attending. I have some intuitions about why OpenAI’s alignment plan wouldn’t work out but I, unfortunately, haven’t thought about this hard enough to explain exactly why while someone red-teams my answers. So I don’t know, doesn’t seem _that_ implausible to me that we could just get lucky with how hard alignment is.
My current model of many people working on alignment is that they’re trying to solve problems that seem like they would be helpful to solve (eg: mechanistic interpretability, model evaluations, etc.) but don’t have an alignment _plan_. This is as expected since having an alignment plan means we have something that we think would solve the problem if it works out. I think Paul Christiano does have a plan in mind. I am currently trying to understand whatever ARC has put out because other people I talk to think Paul is very smart and defer to him a lot.
Things I want to do to figure out my thoughts on things:
Understand the goals of people working on mechanistic interpretability enough to have thoughts on how I expect mechanistic interpretability to progress and how useful I expect it to be
Read ARC’s stuff and form thoughts on if work on the agenda goes really well if this would solve the alignment problem
Think harder about what progress in AI capabilities looks like and what capabilities come in what order
Figure out why Daniel Kokotajlo relies a bunch on the bioanchors model for _his_ forecasts
Understand John Wentworth’s The Plan to figure out if it is any good
Red-team OpenAI’s safety plan without deferring to other people
Read the 2021 MIRI conversations properly
Write my thoughts more clearly as I do the above
I think the above will be useful. I expect for decision-making reasons, I will continue to act based on deferring to people who seem reasonable and whom other people in my circles who seem more knowledgeable and smarter than me defer to.
A lot of reasonable AI alignment ideas are only going to be relevant post-singularity, and changing understanding of timelines keeps reshuffling them out of current relevance. Turns out LLM human imitations very likely can be very capable on their own, without a separate AGI needed to build them up. AI alignment is so poorly understood that any capable AIs that are not just carefully chosen human imitations are going to be more dangerous, that seems like a solid bet for the next few years before LLMs go full AGI. So alignment concerns about such AIs (that are not just human imitations) should be post-singularity concerns for the human imitations to worry about.
How do we get LLM human imitations?
I meant the same thing as masks/simulacra.
Though currently I’m more bullish about the shoggoths, because masks probably fail alignment security, even though their alignment might be quite robust despite the eldritch substrate.
Oh and it also feels like some other things could be even more important for me to think about but I forgot to mention them because I rarely have conversations with people about those things so they feel less salient. Things such as s-risks, governance stuff even if we solve the technical challenge of alignment, what conclusions I can make from the fact that lots of people in alignment disagree pretty deeply with each other.