⅓ of final projects involved evals/demos and ⅕ involved mechanistic interpretability, representing a large proportion of the cohort’s research interests.
this doesn’t seem great in terms of pursuing a broad portfolio of approaches / seems to (partially) confirm worries about Goodhart-ing/overfocusing on projects with clearer feedback loops and legibility, at the detriment of more speculative and more neglected agendas
I don’t see this distribution of research projects as “Goodharting” or “overfocusing” on projects with clear feedback loops. As MATS is principally a program for prosaic AI alignment at the moment, most research conducted within the program should be within this paradigm. We believe projects that frequently “touch reality” often offer the highest expected value in terms of reducing AI catastrophic risk, and principally support non-prosaic, “speculative,” and emerging research agendas for their “exploration value,” which might aid potential paradigm shifts, as well as to round out our portfolio (i.e., “hedge our bets”).
However, even with the focus on prosaic AI alignment research agendas, our Summer 2023 Program supported many emerging or neglected research agendas, including projects in agent foundations, simulator theory, cooperative/multipolar AI (including s-risks), the nascent “activation engineering” approach our program helped pioneer, and the emerging “cyborgism” research agenda.
Additionally, our mentor portfolio is somewhat conditioned on the preferences of our funders. While we largely endorse our funders’ priorities, we are seeking additional funding diversification so that we can support further speculative “research bets”. If you are aware of large funders willing to support our program, please let me know!
What % evals/demos and what % mech interp would you expect to see if there wasn’t Goodharting? 1⁄3 and 1⁄5 doesn’t seem that high to me, given the value of these agendas and the advantages of touching reality that Ryan named.
Hard to be confident here, but maybe half those numbers or even less (especially for evals/demos)?
If you could choose the perfect portfolio allocation, does it seem reasonable to you that > 1⁄2 (assuming no overlap) should go to evals/demos and mech interp?
Another point: Despite our broad call for mentors, only ~2 individuals expressed interest in mentorship who we did not ultimately decide to support. It’s possible our outreach could be improved and I’m happy to discuss in DMs.
this doesn’t seem great in terms of pursuing a broad portfolio of approaches / seems to (partially) confirm worries about Goodhart-ing/overfocusing on projects with clearer feedback loops and legibility, at the detriment of more speculative and more neglected agendas
I don’t see this distribution of research projects as “Goodharting” or “overfocusing” on projects with clear feedback loops. As MATS is principally a program for prosaic AI alignment at the moment, most research conducted within the program should be within this paradigm. We believe projects that frequently “touch reality” often offer the highest expected value in terms of reducing AI catastrophic risk, and principally support non-prosaic, “speculative,” and emerging research agendas for their “exploration value,” which might aid potential paradigm shifts, as well as to round out our portfolio (i.e., “hedge our bets”).
However, even with the focus on prosaic AI alignment research agendas, our Summer 2023 Program supported many emerging or neglected research agendas, including projects in agent foundations, simulator theory, cooperative/multipolar AI (including s-risks), the nascent “activation engineering” approach our program helped pioneer, and the emerging “cyborgism” research agenda.
Additionally, our mentor portfolio is somewhat conditioned on the preferences of our funders. While we largely endorse our funders’ priorities, we are seeking additional funding diversification so that we can support further speculative “research bets”. If you are aware of large funders willing to support our program, please let me know!
What % evals/demos and what % mech interp would you expect to see if there wasn’t Goodharting? 1⁄3 and 1⁄5 doesn’t seem that high to me, given the value of these agendas and the advantages of touching reality that Ryan named.
Hard to be confident here, but maybe half those numbers or even less (especially for evals/demos)?
If you could choose the perfect portfolio allocation, does it seem reasonable to you that > 1⁄2 (assuming no overlap) should go to evals/demos and mech interp?
Another point: Despite our broad call for mentors, only ~2 individuals expressed interest in mentorship who we did not ultimately decide to support. It’s possible our outreach could be improved and I’m happy to discuss in DMs.