I was previously doing a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.
TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it’s incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.
I’ve been a participant in the 2021⁄22 edition. Some thoughts that might make it easier to decide for funders/donors. 1. Impact-per-dollar is probably pretty good for the AISC. It’s incredibly cheap compared to most other AI field-building efforts and scalable. 2. I learned a bunch during AISC and I did enjoy it. It influenced my decision to go deeper into AI safety. It was less impactful than e.g. MATS for me but MATS is a full-time in-person program, so that’s not surprising. 3. AISC fills a couple of important niches in the AI safety ecosystem in my opinion. It’s online and part-time which makes it much easier to join for many people, it implies a much lower commitment which is good for people who want to find out whether they’re a good fit for AIS. It’s also much cheaper than flying everyone to the Bay or London. This also makes it more scalable because the only bottleneck is mentoring capacity without physical constraints. 4. I think AISC is especially good for people who want to test their fit but who are not super experienced yet. This seems like an important function. MATS and ARENA, for example, feel like they target people a bit deeper into the funnel with more experience who are already more certain that they are a good fit. 5. Overall, I think AISC is less impactful than e.g. MATS even without normalizing for participants. Nevertheless, AISC is probably about ~50x cheaper than MATS. So when taking cost into account, it feels clearly impactful enough to continue the project. I think the resulting projects are lower quality but the people are also more junior, so it feels more like an early educational program than e.g. MATS. 6. I have a hard time seeing how the program could be net negative unless something drastically changed since my cohort. In the worst case, people realize that they don’t like one particular type of AI safety research. But since you chat with others who are curious about AIS regularly, it will be much easier to start something that might be more meaningful. Also, this can happen in any field-building program, not just AISC. 7. Caveat: I have done no additional research on this. Maybe others know details that I’m unaware of. See this as my personal opinion and not a detailed research analysis.
I feel like both of your points are slightly wrong, so maybe we didn’t do a good job of explaining what we mean. Sorry for that.
1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I’d be surprised if the way we currently do demonstrations could not be improved by better science. 1b) Even if you claim you just did a demo or an existence proof and explicitly state that this should not be seen as evidence of absence, people will still see the absence of evidence as negative evidence. I think the “we ran all the evals and didn’t find anything” sentiment will be very strong, especially when deployment depends on not failing evals. So you should deal with that problem from the start IMO. Furthermore, I also think we should aim to build evals that give us positive guarantees if that’s possible. I’m not sure it is possible but we should try. 1c) The airplane analogy feels like a strawman to me. The upper bound is obviously not on explosivity, it would be a statement like “Within this temperature range, the material the wings are made of will break once in 10M flight miles on average” or something like that. I agree that airplanes are simpler and less high-dimensional. That doesn’t mean we should not try to capture most of the variance anyway even if it requires more complicated evals. Maybe we realize it doesn’t work and the variance is too high but this is why we diversify agendas.
2a) The post is primarily about building a scientific field and that field then informs policy and standards. A great outcome of the post would be if more scientists did research on this. If this is not clear, then we miscommunicated. The point is to get more understanding so we can make better predictions. These predictions can then be used in the real world. 2b) It really is not “we need to find standardised numbers to measure so we can talk to serious people” and less “let’s try to solve that thing where we can’t reliably predict much about our AIs”. If that was the main takeaway, I think the post would be net negative.
3) But the optimization requires computation? For example, if you run 100 forward passes for your automated red-teaming algorithm with model X, that requires Y FLOP of compute. I’m unsure where the problem is.
Not quite sure tbh. 1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant. 2. I’m not sure how true your suggestion is but I haven’t tried it a lot empirically. But this is exactly the kind of stuff I’d like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don’t have enough confidence in. Or at least it hasn’t been established as a standard in evals.
I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them.
I’m not sure the post would have been better if we used a more narrow title, e.g. “We need a science of capability evaluations” because the natural question then would be “But why not for propensity tests or for this other type of eval. I think the broader point of “when we do evals, we need some reason to be confident in the results no matter which kind of eval” seems to be true across all of them.
I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won’t play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world (but not more than 5-10 years), humans giving more and more power into the hands of AI, infighting within the AI safety community, and many more of the predictions made in this post.
The main thing I disagree with (as I already updated in April 2023) is that the timelines underlying the post are too long. I now think almost everything is going to happen in at least half of the time presented in the post, e.g. many events in the 2030-2040 section may already happen before 2030.
In general, I can strongly recommend taking a weekend or so to write a similar story for yourselves. I felt like it made many of the otherwise fairly abstract implications of timeline and takeoff models much more salient to me and others who are less in the weeds with formal timeline / takeoff models.
I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore’s law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore’s law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware
If anything, my disagreements have become stronger and my timelines have become shorter over time. Even the aggressive model I present in the post seems too conservative for my current views and my median date is 2030 or earlier. I have substantial probability mass on an AI that could automate most current jobs before 2026 which I didn’t have at the time of writing.
I also want to point out that Daniel Kokotajlo, whom I spent some time talking about bio anchors and Tom Davidson’s takeoff model with, seemed to have consistently better intuitions than me (or anyone else I’m aware of) on timelines. The jury is still out there, but so far it looks like reality follows his predictions more than mine. At least in my case, I updated significantly toward shorter timelines multiple times due to arguments he made.
I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it’s not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It’s quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables.
One thing that bothered me after the post is that Johannes Ackva (who’s arguably a world-leading expert in this field) and Samuel + me just didn’t seem to be able to communicate where we disagree. He expressed that he thought some of our arguments were wrong but we never got to the crux of the disagreement.
After listening to his appearance on 80k: https://80000hours.org/podcast/episodes/johannes-ackva-unfashionable-climate-interventions/ I feel like I understand the core of the disagreement much better (though I never confirmed with Johannes). He mostly looks at energy through a lens of scale, neglectedness, and traceability, i.e. he’s looking to investigate and push interventions that are most efficient on the margin. On the margin, nuclear seems underinvested and lots of reasonable options are underexplored (e.g. large-scale production of smaller reactors), both Samuel and I would agree with that. However, the claim we were trying to make in the post was that nuclear is already more expensive than renewables and this gap will likely just increase in the future. Thus, it makes sense to, in total, invest more in renewables than nuclear. Also, there were lots of smaller things where I felt like I understood his position much better after listening to the podcast.
In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading.
In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore’s law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth is probably faster than Moore’s law.
The main reason is that FP32 is just not the right thing to look at in modern ML and we even knew this at the time of writing, i.e. it ignores tensor cores and lower precisions like TF16 or INT8.
I’m a little worried that people who read this post but don’t have any background in ML got the wrong takeaway from the post and we should have emphasized this difference even more at the time. We have written a follow-up post about this recently here: https://epochai.org/blog/trends-in-machine-learning-hardware I feel like the new post does a better job at explaining where compute progress comes from.
I haven’t talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out “we’re building an entity that is smarter than us but we don’t know how to control it” is quite intuitively scary. As you would expect, most people still don’t update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).
At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts.
In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time.
For example, one could start as an organization that builds a product but merely as a means to subsidize safety research. However, when they have to make tradeoffs, these organizations might choose to focus more talent on product because it is instrumentally useful or even necessary for the survival of the company. The forces that pull toward profit (e.g. VCs, status, growth) are much more tangible than the forces pulling towards safety. Thus, I could see many ways in which this goes wrong.
A second example: Imagine an organization that builds evals and starts with the intention of evaluating the state-of-the-art models because they are most likely to be risky. Soon they realize that there are only a few orgs that build the best models and there are a ton of customers that work with non-frontier systems who’d be willing to pay them a lot of money to build evals for their specific application. Thus, the pull toward doing less impactful but plausibly more profitable work is stronger than the pull in the other direction.
Lastly, one thing I’m somewhat afraid of is that it’s very easy to rationalize all of these decisions in the moment. It’s very easy to say that a strategic shift toward profit-seeking is instrumentally useful for the organization, growth, talent, etc. And there are cases in which this is true. However, it’s easy to continue such a rationalization spree and maneuver yourself into some nasty path dependencies. Some VCs only came on for the product, some hires only want to ship stuff, etc.
In conclusion, I think it’s possible to do profitable safety work but it’s hard. You should be confident that your two goals are compatible when things get hard, you should have a team and culture that can resist the pulls and even produce counter pulls when you’re not doing safety-relevant work and you should only work with funders who fully understand and buy into your true mission.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.
We considered alternative definitions of DA in Appendix C.
We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):
“An AI is deceptively aligned when it is strategically deceptive about its misalignment”
Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities.
For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment
Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment.
For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).
Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I’d not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it’s far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.
(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I’d say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn’t reason through that, I’d probably not classify it as such. I think it’s fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.
2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven’t thought about it in detail, so I might change my mind. However, currently, I think the following: a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I’m happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception. b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it’s doing and then the deception either stops or becomes intentional. (weekly held, haven’t thought about it in detail)
Copying from EAF
TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it’s incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.
I’ve been a participant in the 2021⁄22 edition. Some thoughts that might make it easier to decide for funders/donors.
1. Impact-per-dollar is probably pretty good for the AISC. It’s incredibly cheap compared to most other AI field-building efforts and scalable.
2. I learned a bunch during AISC and I did enjoy it. It influenced my decision to go deeper into AI safety. It was less impactful than e.g. MATS for me but MATS is a full-time in-person program, so that’s not surprising.
3. AISC fills a couple of important niches in the AI safety ecosystem in my opinion. It’s online and part-time which makes it much easier to join for many people, it implies a much lower commitment which is good for people who want to find out whether they’re a good fit for AIS. It’s also much cheaper than flying everyone to the Bay or London. This also makes it more scalable because the only bottleneck is mentoring capacity without physical constraints.
4. I think AISC is especially good for people who want to test their fit but who are not super experienced yet. This seems like an important function. MATS and ARENA, for example, feel like they target people a bit deeper into the funnel with more experience who are already more certain that they are a good fit.
5. Overall, I think AISC is less impactful than e.g. MATS even without normalizing for participants. Nevertheless, AISC is probably about ~50x cheaper than MATS. So when taking cost into account, it feels clearly impactful enough to continue the project. I think the resulting projects are lower quality but the people are also more junior, so it feels more like an early educational program than e.g. MATS.
6. I have a hard time seeing how the program could be net negative unless something drastically changed since my cohort. In the worst case, people realize that they don’t like one particular type of AI safety research. But since you chat with others who are curious about AIS regularly, it will be much easier to start something that might be more meaningful. Also, this can happen in any field-building program, not just AISC.
7. Caveat: I have done no additional research on this. Maybe others know details that I’m unaware of. See this as my personal opinion and not a detailed research analysis.