There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I’m not sure how worrying this is, but i haven’t noticed others mentioning it. Hoping to get some second opinions.
Here’s a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn’t we use to get a whole new line-of-attack on the problem every couple months?
By “innovation”, I don’t mean something normative like “This is impressive” or “This is research I’m glad happened”. Rather, I mean something more low-level, almost syntactic, like “Here’s a new idea everyone is talking out”. This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would’ve been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.
(2) When did AI safety innovation peak?
My guess is Spring 2022, during the ELK Prize era. I’m not sure though. What do you guys think?
(3) What’s caused the slow down?
Possible explanations:
ideas are harder to find
people feel less creative
people are more cautious
more publishing in journals
research is now closed-source
we lost the mandate of heaven
the current ideas are adequate
paul christiano stopped posting
i’m mistaken, innovation hasn’t stopped
something else
(4) How could we measure “innovation”?
By “innovation” I mean non-transient novelty. An article is “novel” if it uses n-grams that previous articles didn’t use, and an article is “transient” if it uses n-grams that subsequent articles didn’t use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram “gradient hacking” doesn’t appear in older articles, but appears often in subsequent articles. See below.
In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric “resonance”, which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.
My claim is recent AI safety research has been less resonant.
I’ve added a fourth section to my post. It operationalises “innovation” as “non-transient novelty”. Some representative examples of an innovation would be:
My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures.
Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.
thanks for the thoughts. i’m still trying to disentangle what exactly I’m point at.
I don’t intend “innovation” to mean something normative like “this is impressive” or “this is research I’m glad happened” or anything. i mean something more low-level, almost syntactic. more like “here’s a new idea everyone is talking out”. this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i’m not sure how worrying this is, but i haven’t noticed others mentioning it.
NB: here’s 20 random terms I’m imagining included in the dictionary:
My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere. E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary—free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don’t feel it is high priority to write them for LW, because they don’t fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to: - advocacy— topics a large crowd cares about (e.g. mech interpretability) - or topics some prolific and good writer cares about (e.g. people will read posts by John Wentworth) Hot take, but the community loosely associated with active inference is currently better place to think about agent foundations; workshops on topics like ‘pluralistic alignment’ or ‘collective intelligence’ have in total more interesting new ideas about what was traditionally understood as alignment; parts of AI safety went totally ML-mainstream, with the fastest conversation happening at x.
I remember this point that yampolskiy made for impossibleness of AGI alignment on a podcast that as a young field AI safety had underwhelming low hanging fruits, I wonder if all of the major low hanging ones have been plucked.
I think the explanation that more research is closed source pretty compactly explains the issue, combined with labs/companies making a lot of the alignment progress to date.
Also, you probably won’t hear about most incremental AI alignment progress on LW, for the simple reason that it probably would be flooded with it, so people will underestimate progress.
Alexander Gietelink Oldenziel does talk about pockets of Deep Expertise in academia, but they aren’t activated right now, so it is so far irrelevant to progress.
people may feel intimidated or discouraged from sharing ideas because of ~‘high standards’, or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.
i have experienced this, but i don’t know how common it is.
i just also recalled that janus has said they weren’t sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
yep, something like more carefulness, less “playfulness” in the sense of [Please don’t throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.
(1) Has AI safety slowed down?
There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I’m not sure how worrying this is, but i haven’t noticed others mentioning it. Hoping to get some second opinions.
Here’s a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn’t we use to get a whole new line-of-attack on the problem every couple months?
By “innovation”, I don’t mean something normative like “This is impressive” or “This is research I’m glad happened”. Rather, I mean something more low-level, almost syntactic, like “Here’s a new idea everyone is talking out”. This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would’ve been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.
(2) When did AI safety innovation peak?
My guess is Spring 2022, during the ELK Prize era. I’m not sure though. What do you guys think?
(3) What’s caused the slow down?
Possible explanations:
ideas are harder to find
people feel less creative
people are more cautious
more publishing in journals
research is now closed-source
we lost the mandate of heaven
the current ideas are adequate
paul christiano stopped posting
i’m mistaken, innovation hasn’t stopped
something else
(4) How could we measure “innovation”?
By “innovation” I mean non-transient novelty. An article is “novel” if it uses n-grams that previous articles didn’t use, and an article is “transient” if it uses n-grams that subsequent articles didn’t use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram “gradient hacking” doesn’t appear in older articles, but appears often in subsequent articles. See below.
In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric “resonance”, which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.
My claim is recent AI safety research has been less resonant.
Here’s 20 random terms that would be in the glossary, to illustrate what I mean:
the approaches that have been attracting the most attention and funding are dead ends
Also, I’m curious what it is that you consider(ed) AI safety progress/innovation. Can you give a few representative examples?
I’ve added a fourth section to my post. It operationalises “innovation” as “non-transient novelty”. Some representative examples of an innovation would be:
Gradient hacking (Hubinger, 2019)
Simulators (Janus, 2022)
Steering GPT-2-XL by adding an activation vector (Turner et al, 2023)
I think these articles were non-transient and novel.
My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures.
Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.
thanks for the thoughts. i’m still trying to disentangle what exactly I’m point at.
I don’t intend “innovation” to mean something normative like “this is impressive” or “this is research I’m glad happened” or anything. i mean something more low-level, almost syntactic. more like “here’s a new idea everyone is talking out”. this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i’m not sure how worrying this is, but i haven’t noticed others mentioning it.
NB: here’s 20 random terms I’m imagining included in the dictionary:
My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere. E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary—free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don’t feel it is high priority to write them for LW, because they don’t fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy—
topics a large crowd cares about (e.g. mech interpretability)
- or topics some prolific and good writer cares about (e.g. people will read posts by John Wentworth)
Hot take, but the community loosely associated with active inference is currently better place to think about agent foundations; workshops on topics like ‘pluralistic alignment’ or ‘collective intelligence’ have in total more interesting new ideas about what was traditionally understood as alignment; parts of AI safety went totally ML-mainstream, with the fastest conversation happening at x.
I remember this point that yampolskiy made for impossibleness of AGI alignment on a podcast that as a young field AI safety had underwhelming low hanging fruits, I wonder if all of the major low hanging ones have been plucked.
I think the explanation that more research is closed source pretty compactly explains the issue, combined with labs/companies making a lot of the alignment progress to date.
Also, you probably won’t hear about most incremental AI alignment progress on LW, for the simple reason that it probably would be flooded with it, so people will underestimate progress.
Alexander Gietelink Oldenziel does talk about pockets of Deep Expertise in academia, but they aren’t activated right now, so it is so far irrelevant to progress.
adding another possible explanation to the list:
people may feel intimidated or discouraged from sharing ideas because of ~‘high standards’, or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.
i have experienced this, but i don’t know how common it is.
i just also recalled that janus has said they weren’t sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
yep, something like more carefulness, less “playfulness” in the sense of [Please don’t throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.