Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.
Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:
Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
In regards to:
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.