Currently doing alignment and digital minds research @AE Studio
Meta AI Resident ’23, Cognitive science @ Yale ‘22, SERI MATS ’21, LTFF grantee.
Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Currently doing alignment and digital minds research @AE Studio
Meta AI Resident ’23, Cognitive science @ Yale ‘22, SERI MATS ’21, LTFF grantee.
Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Thanks for writing up your thoughts on RLHF in this separate piece, particularly the idea that ‘RLHF hides whatever problems the people who try to solve alignment could try to address.’ We definitely agree with most of the thrust of what you wrote here and do not believe nor (attempt to) imply anywhere that RLHF indicates that we are globally ‘doing well’ on alignment or have solved alignment or that alignment is ‘quite tractable.’ We explicitly do not think this and say so in the piece.
With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)
The broader point we are making in this post is that the entire world is moving full steam ahead towards more powerful AI whether we like it or not, and so discovering and deploying alignment techniques that move in the direction of actually satisfying this impossible-to-ignore attractor while also maximally decreasing the probability that the “giant wave will crash into human society and destroy it” seem worth pursuing—especially compared to the very plausible counterfactual world where everyone just pushes ahead with capabilities anyways without any corresponding safety guarantees.
While we do point to RLHF in the piece as one nascent example of what this sort of thing might look like, we think the space of possible approaches with a negative alignment tax is potentially vast. One such example we are particularly interested in (unlike RLHF) is related to implicit/explicit utility function overlap, mentioned in this comment.
I think the post makes clear that we agree RLHF is far from perfect and won’t scale, and definitely appreciate the distinction between building aligned models from the ground-up vs. ‘bolting on’ safety techniques (like RLHF) after the fact.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
Thanks, it definitely seems right that improving capabilities through alignment research (negative Thing 3) doesn’t necessarily make safe AI easier to build than unsafe AI overall (Things 1 and 2). This is precisely why if techniques for building safe AI were discovered that were simultaneously powerful enough to move us toward friendly TAI (and away from the more-likely-by-default dangerous TAI, to your point), this would probably be good from an x-risk perspective.
We aren’t celebrating any capability improvement that emerges from alignment research—rather, we are emphasizing the expected value of techniques that inherently improve both alignment and capabilities such that the strategic move for those who want to build maximally-capable AI shifts towards the adoption of these techniques (and away from the adoption of approaches that might cause similar capabilities gains without the alignment benefits). This is a more specific take than just a general ‘negative Thing 3.’
I also think there is a relevant distinction between ‘mere’ dual-use research and what we are describing here—note the difference between points 1 and 2 in this comment.
Just because the first approximation of a promising alignment agenda in a toy environment would not self-evidently scale to superintelligence does not mean it is not worth pursuing and continuing to refine that agenda so that the probability of it scaling increases.
We do not imply or claim anywhere that we already know how to ‘build a rocket that doesn’t explode.’ We do claim that no other alignment agenda has determined how to do this thus far, that the majority of other alignment researchers agree with this prognosis, and that we therefore should be pursuing promising approaches like SOO to increase the probability of the proverbial rocket not exploding.
In the RL experiment, we were only measuring SOO as a means of deception reduction in the agent seeing color (blue agent), and the fact that the colorblind agent is an agent at all is not consequential for our main result.
Please also see here and here, where we’ve described why the goal is not simply to maximize SOO in theory or in practice.
Thanks for all these additional datapoints! I’ll try to respond all of your questions in turn:
Did you find that your AIS survey respondents with more AIS experience were significantly more male than newer entrants to the field?
Overall, there don’t appear to be major differences when filtering for amount of alignment experience. When filtering for greater than vs. less than 6 months of experience, it does appear that the ratio looks more like ~5 M:F; at greater than vs. less than 1 year of experience, it looks like ~8 M:F; the others still look like ~9 M:F. Perhaps the changes you see over the past two years at MATS are too recent to be reflected fully in this data, but it does seem like a generally positive signal that you see this ratio changing (given what we discuss in the post).
Has AE Studio considered sponsoring significant bounties or impact markets for scoping promising new AIS research directions?
We definitely want to do everything we can to support increased exploration of neglected approaches—if you have specific ideas here, we’d love to hear them and discuss more! Maybe we can follow up offline on this.
Did survey respondents mention how they proposed making AIS more multidisciplinary? Which established research fields are more needed in the AIS community?
We don’t appear to have gotten many practical proposals for how to make AIS more multidisciplinary, but there were a number of specific disciplines mentioned in the free responses, including cognitive psychology, neuroscience, game theory, behavioral science, ethics/law/sociology, and philosophy (epistemology was specifically brought up across multiple respondents). One respondent wrote, “AI alignment is dominated by computer scientists who don’t know much about human nature, and could benefit from more behavioral science expertise and game theory,” which I think captures the sentiment of many of the related responses most succinctly (however accurate this statement actually is!). Ultimately, encouraging and funding research at the intersection of these underexplored areas and alignment is likely the only thing that will actually lead to a more multidisciplinary research environment.
Did EAs consider AIS exclusively a longtermist cause area, or did they anticipate near-term catastrophic risk from AGI?
Unfortunately, I don’t think we asked the EA sample about AIS in a way that would allow us to answer this question using the data we have. This would be a really interesting follow-up direction. I will paste in below the ground truth distribution of EAs’ views on the relative promise of these approaches as additional context (eg, we see that the ‘AI risk’ and ‘Existential risk (general)’ distributions have very similar shapes), but I don’t think we can confidently say much about whether these risks were being conceptualized as short- or long-term.
It’s also important to highlight that in the alignment sample (from the other survey), researchers generally indicate that they do not think we’re going to get AGI in the next five years. Again, this doesn’t clarify if they think there are x-risks that could emerge in the nearer term from less-general-but-still-very-advanced AI, but it does provide an additional datapoint that if we are considering AI x-risks to be largely mediated by the advent of AGI, alignment researchers don’t seem to expect this as a whole in the very short term:
I expect this to generally be a more junior group, often not fully employed in these roles, with eg the average age and funding level of the orgs that are being led particularly low (and some of the orgs being more informal).
Here is the full list of the alignment orgs who had at least one researcher complete the survey (and who also elected to share what org they are working for): OpenAI, Meta, Anthropic, FHI, CMU, Redwood Research, Dalhousie University, AI Safety Camp, Astera Institute, Atlas Computing Institute, Model Evaluation and Threat Research (METR, formerly ARC Evals), Apart Research, Astra Fellowship, AI Standards Lab, Confirm Solutions Inc., PAISRI, MATS, FOCAL, EffiSciences, FAR AI, aintelope, Constellation, Causal Incentives Working Group, Formalizing Boundaries, AISC.
~80% of the alignment sample is currently receiving funding of some form to pursue their work, and ~75% have been doing this work for >1 year. Seems to me like this is basically the population we were intending to sample.
One additional factor for my abandoning it was that I couldn’t imagine it drawing a useful response population anyway; the sample mentioned above is a significant surprise to me (even with my skepticism around the makeup of that population). Beyond the reasons I already described, I felt that it being done by a for-profit org that is a newcomer and probably largely unknown would dissuade a lot of people from responding (and/or providing fully candid answers to some questions).
Your expectation while taking the survey about whether we were going to be able to get a good sample does not say much about whether we did end up getting a good sample. Things that better tell us whether or not we got a good sample are, eg, the quality/distribution of the represented orgs and the quantity of actively-funded technical alignment researchers (both described above).
All in all, I expect that the respondent population skews heavily toward those who place a lower value on their time and are less involved.
Note that the survey took people ~15 minutes to complete and resulted in a $40 donation being made to a high-impact organization, which puts our valuation of an hour of their time at ~$160 (roughly equivalent to the hourly rate of someone who makes ~$330k annually). Assuming this population would generally donate a portion of their income to high-impact charities/organizations by default, taking the survey actually seems to probably have been worth everyone’s time in terms of EV.
There’s a lot of overlap between alignment researchers and the EA community, so I’m wondering how that was handled.
Agree that there is inherent/unavoidable overlap. As noted in the post, we were generally cautious about excluding participants from either sample for reasons you mention and also found that the key results we present here are robust to these kinds of changes in the filtration of either dataset (you can see and explore this for yourself here).
With this being said, we did ask in both the EA and the alignment survey to indicate the extent to which they are involved in alignment—note the significance of the difference here:
From alignment survey:
From EA survey:
This question/result serves both as a good filtering criterion for cleanly separating out EAs from alignment researchers and also gives a pretty strong evidence that we are drawing on completely different samples across these surveys (likely because we sourced the data for each survey through completely distinct channels).
Regarding the support for various cause areas, I’m pretty sure that you’ll find the support for AI Safety/Long-Termism/X-risk is higher among those most involved in EA than among those least involved. Part of this may be because of the number of jobs available in this cause area.
Interesting—I just tried to test this. It is a bit hard to find a variable in the EA dataset that would cleanly correspond to higher vs. lower overall involvement, but we can filter by number of years one has been involved involved in EA, and there is no level-of-experience threshold I could find where there are statistically significant differences in EAs’ views on how promising AI x-risk is. (Note that years of experience in EA may not be the best proxy for what you are asking, but is likely the best we’ve got to tackle this specific question.)
Blue is >1 year experience, red is <1 year experience:
Blue is >2 years experience, red is <2 years experience:
Thanks for the comment!
Consciousness does not have a commonly agreed upon definition. The question of whether an AI is conscious cannot be answered until you choose a precise definition of consciousness, at which point the question falls out of the realm of philosophy into standard science.
Agree. Also happen to think that there are basic conflations/confusions that tend to go on in these conversations (eg, self-consciousness vs. consciousness) that make the task of defining what we mean by consciousness more arduous and confusing than it likely needs to be (which isn’t to say that defining consciousness is easy). I would analogize consciousness to intelligence in terms of its difficulty to nail down precisely, but I don’t think there is anything philosophically special about consciousness that inherently eludes modeling.
is there some secret sauce that makes the algorithm [that underpins consciousness] special and different from all currently known algorithms, such that if we understood it we would suddenly feel enlightened? I doubt it. I expect we will just find a big pile of heuristics and optimization procedures that are fundamentally familiar to computer science.
Largely agree with this too—it very well may be the case (as seems now to be obviously true of intelligence) that there is no one ‘master’ algorithm that underlies the whole phenomenon, but rather as you say, a big pile of smaller procedures, heuristics, etc. So be it—we definitely want to better understand (for reasons explained in the post) what set of potentially-individually-unimpressive algorithms, when run in concert, give you system that is conscious.
So, to your point, there is not necessarily any one ‘deep secret’ to uncover that will crack the mystery (though we think, eg, Graziano’s AST might be a strong candidate solution for at least part of this mystery), but I would still think that (1) it is worthwhile to attempt to model the functional role of consciousness, and that (2) whether we actually have better or worse models of consciousness matters tremendously.
There will be places on the form to indicate exactly this sort of information :) we’d encourage anyone who is associated with alignment to take the survey.
Thanks for taking the survey! When we estimated how long it would take, we didn’t count how long it would take to answer the optional open-ended questions, because we figured that those who are sufficiently time constrained that they would actually care a lot about the time estimate would not spend the additional time writing in responses.
In general, the survey does seem to take respondents approximately 10-20 minutes to complete. As noted in another comment below,
this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one’s individual results to that of community), which hopefully is worth the time investment :)
Ideally within the next month or so. There are a few other control populations still left to sample, as well as actually doing all of the analysis.
Thanks for sharing this! Will definitely take a look at this in the context of what we find and see if we are capturing any similar sentiment.
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.