Anthropic employees: stop deferring to Dario on politics. Think for yourself.
Do your company’s actions actually make sense if it is optimizing for what you think it is optimizing for?
Anthropic lobbied against mandatory RSPs, against regulation, and, for the most part, didn’t even support SB-1047. The difference between Jack Clark and OpenAI’s lobbyists is that publicly, Jack Clark talks about alignment. But when they talk to government officials, there’s little difference on the question of existential risk from smarter-than-human AI systems. They do not honestly tell the governments what the situation is like. Ask them yourself.
A while ago, OpenAI hired a lot of talent due to its nonprofit structure.
Anthropic is now doing the same. They publicly say the words that attract EAs and rats. But it’s very unclear whether they institutionally care.
Dozens work at Anthropic on AI capabilities because they think it is net-positive to get Anthropic at the frontier, even though they wouldn’t work on capabilities at OAI or GDM.
It is not net-positive.
Anthropic is not our friend. Some people there do very useful work on AI safety (where “useful” mostly means “shows that the predictions of MIRI-style thinking are correct and we don’t live in a world where alignment is easy”, not “increases the chance of aligning superintelligence within a short timeframe”), but you should not work there on AI capabilities.
Anthropic’s participation in the race makes everyone fall dead sooner and with a higher probability.
Work on alignment at Anthropic if you must. I don’t have strong takes on that. But don’t do work for them that advances AI capabilities.
I think you should try to clearly separate the two questions of
Is their work on capabilities a net positive or net negative for humanity’s survival?
Are they trying to “optimize” for humanity’s survival, and do they care about alignment deep down?
I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won’t any measure of their social status be far higher after the singularity, if it’s found that they tried to do the best for humanity?
I’m not sure about 1. I think even they’re not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.
You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We’re all on the same ship after all, when it comes to AI alignment.
PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it’s my best guess...
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter
There is very little hope IMHO in increase spending on technical AI alignment because (as far as we can tell based on how slow progress has been on it over the last 22 years) it is a much thornier problem than AI capability research and because most people doing AI alignment research don’t have a viable story about how they are going to stop any insights / progress they achieve from helping with AI capability research. I mean, if you have a specific plan that avoids these problems, then let’s hear it, I am all ears, but advocacy in general of increasing work on technical alignment is counterproductive IMHO.
EDIT: thank you so much for replying to the strongest part of my argument, no one else tried to address it (despite many downvotes).
I disagree with the position that technical AI alignment research is counterproductive due to increasing capabilities, but I think this is very complicated and worth thinking about in greater depth.
Do you think it’s possible, that your intuition on alignment research being counterproductive, is because you compared the plausibility of the two outcomes:
Increasing alignment research causes people to solve AI alignment, and humanity survives.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity.
And you decided that outcome 2 felt more likely?
Well, that’s the wrong comparison to make.
The right comparison should be:
Increasing alignment research causes people to improve AI alignment, and humanity survives in a world where we otherwise wouldn’t survive.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity in a world where we otherwise would survive.
In this case, I think even you would agree what P(1) > P(2).
P(2) is very unlikely because if increasing alignment research really would lead to such a superintelligence, and it really would kill humanity… then let’s be honest, we’re probably doomed in that case anyways, even without increasing alignment research.
If that really was the case, the only surviving civilizations would have had different histories, or different geographies (e.g. only a single continent with enough space for a single country), leading to a single government which could actually enforce an AI pause.
We’re unlikely to live in a world so pessimistic that alignment research is counterproductive, yet so optimistic that we could survive without that alignment research.
we’re probably doomed in that case anyways, even without increasing alignment research.
I believe we’re probably doomed anyways.
I think even you would agree what P(1) > P(2)
Sorry to disappoint you, but I do not agree.
Although I don’t consider it quite impossible that we will figure out alignment, most of my hope for our survival is in other things, such as a group taking over the world and then using their power to ban AI research. (Note that that is in direct contradiction to your final sentence.) So for example, if Putin or Xi were dictator of the world, my guess is that there is a good chance he would choose to ban all AI research. Why? It has unpredictable consequences. We Westerners (particularly Americans) are comfortable with drastic change, even if that change has drastic unpredictable effects on society; non-Westerners are much more skeptical: there have been too many invasions, revolutions and peasant rebellions that have killed millions in their countries. I tend to think that the main reason Xi supports China’s AI industry is to prevent the US and the West from superseding China and if that consideration were removed (because for example he had gained dictatorial control over the whole world) he’d choose to just shut it down (and he wouldn’t feel that need to have a very strong argument for that shutting it down like Western decision-makers would: non-Western leader shut important things down all the time or at least they would if the governments they led had the funding and the administrative capacity to do so).
Of course Xi’s acquiring dictatorial control over the whole world is extremely unlikely, but the magnitude of the technological changes and societal changes that are coming will tend to present opportunities for certain coalitions to gain and to keep enough power to shut AI research down worldwide. (Having power in all countries hosting leading-edge fabs is probably enough.) I don’t think this ruling coalition necessarily need to believe that AI presents a potent risk of human extinction for them to choose to shut it down.
I am aware that some reading this will react to “some coalition manages to gain power over the whole world” even more negatively than to “AI research causes the extinction of the entire human race”. I guess my response is that I needed an example of a process that could save us and that would feel plausible—i.e., something that might actually happen. I hasten add that there might be other processes that save us that don’t elicit such a negative reaction—including processes the nature of which we cannot even currently imagine.
I’m very skeptical of any intervention that reduces the amount of time we have left in the hopes that this AI juggernaut is not really as potent a threat to us as it currently appears. I was much much less skeptical of alignment research 20 years ago, but since then a research organization has been exploring the solution space and the leader of that organization (Nate Soares) and its most senior researcher (Eliezer) are reporting that the alignment project is almost completely hopeless. Yes, this organization (MIRI) is kind of small, but it has been funded well enough to keep about a dozen top-notch researchers on the payroll and it has been competently led. Also, for research efforts like this, how many years the team had to work on the problem is more important than the size of the team, and 22 years is a pretty long time to end up with almost no progress other than some initial insights (around the orthogonality thesis, the fragility of value, convergent instrumental values, CEV as a solution to if the problem were solvable by the current generation of human beings.
OK, if I’m being fair and balanced, then I have to concede that it was probably only in 2006 (when Eliezer figured out how to write a long intellectually-dense blog post every day) or even only in 2008 (when Anna Salamon join the organization—she was very good at recruiting and had a lot of energy to travel and to meet people) that Eliezer’s research organization could start to pick and choose among a broad pool of very talented people, but still between 2008 and now is 17 years, which again is a long time for a strong team to fail to make even a decent fraction of the progress humanity would seem to need to make on the alignment problem if in fact the alignment problem is solvable by spending more money on it. It does not appear to me to be the sort of problem than can be solved with 1 or 2 additional insights; it seems a lot more like the kind of problem where insight 1 is needed, but before any mere human can find insight 1, all the researchers need to have already known insight 2, and to have any hope of finding insight 2, they all would have had to know insight 3, and so on.
Our worldviews do not match, and I fail to see how yours makes sense. Even when I relax my predictions about the future to take in a wider set of possible paths… I still don’t get it.
AI is here. AGI is coming whether you like it or not. ASI will probably doom us.
Anthropic, as an org, seems to believe that there is a threshold of power beyond which creating an AGI more powerful than that would kill us all.
OpenAI may believe this also, in part, but it seems like their expectation of where that threshold is is further away than mine. Thus, I think there is a good chance they will get us all killed. There is substantial uncertainty and risk around these predictions.
Now, consider that, before AGI becomes so powerful that utilizing it for practical purposes becomes suicide, there is a regime where the AI product gives its wielder substantial power. We are currently in that regime. The further AI gets advanced, the more power it grants.
Anthropic might get us all killed. OpenAI is likely to get us all killed. If you tryst the employees of Anthropic to not want to be killed by OpenAI… then you should realize that supporting them while hindering OpenAI is at least potentially a good bet.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
It seems to me that other possibilities exist, besides “has model with numbers” or “confused.” For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that’s not this world.
nuclear weapons have different game theory. if your adversary has one, you want to have one to not be wiped out; once both of you have nukes, you don’t want to use them.
also, people were not aware of real close calls until much later.
with ai, there are economic incentives to develop it further than other labs, but as a result, you risk everyone’s lives for money and also create a race to the bottom where everyone’s lives will be lost.
That’s a very good heuristic. I bet even Anthropic agrees with it. Anthropic did not release their newer models until OpenAI released ChatGPT and the race had already started.
That’s not a small sacrifice. Maybe if they released it sooner, they would be bigger than OpenAI right now due to the first mover advantage.
I believe they want the best for humanity, but they are in a no-win situation, and it’s a very tough choice what they should do. If they stop trying to compete, the other AI labs will build AGI just as fast, and they will lose all their funds. If they compete, they can make things better.
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter.
Even if they don’t support all the regulations you believe in, they’re the big AI company supporting relatively much more regulation than all the others.
I don’t know, I may be wrong. Sadly it is so very hard to figure out what’s good or bad for humanity in this uncertain time.
I don’t think that most people, upon learning that Anthropic’s justification was “other companies were already putting everyone’s lives at risk, so our relative contribution to the omnicide was low” would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like “it’s not okay to risk extinction, period.” That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they’d overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity.
Many ordinary people (maybe most) would have done the same thing in their shoes.
Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that’s the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won’t feel that outraged at their “consequentialism.” They may be outraged an perceived incompetence, but Anthropic definitely won’t be the only one accused of incompetence.
If you trust the employees of Anthropic to not want to be killed by OpenAI
In your mind, is there a difference between being killed by AI developed by OpenAI and by AI developed by Anthropic? What positive difference does it make, if Anthropic develops a system that kills everyone a bit earlier than OpenAI would develop such a system? Why do you call it a good bet?
AGI is coming whether you like it or not
Nope.
You’re right that the local incentives are not great: having a more powerful model is hugely economically beneficial, unless it kills everyone.
But if 8 billion humans knew what many of LessWrong users know, OpenAI, Anthropic, DeepMind, and others cannot develop what they want to develop, and AGI doesn’t come for a while.
From the top of my head, it actually likely could be sufficient to either (1) inform some fairly small subset of 8 billion people of what the situation is or (2) convince that subset that the situation as we know it is likely enough to be the case that some measures to figure out the risks and not be killed by AI in the meantime are justified. It’s also helpful to (3) suggest/introduce/support policies that change the incentives to race or increase the chance of (1) or (2).
A theory of change some have for Anthropic is that Anthropic might get in position to successfully do one of these two things.
My shortform post says that the real Anthropic is very different from the kind of imagined Anthropic that would attempt to do these nope. Real Anthropic opposes these things.
People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.
Anthropic ppl had also said approximately this publicly. Saying that it’s too soon to make the rules, since we’d end up mispecifying due to ignorance of tomorrow’s models.
There’s a big difference between regulation which says roughly “you must have something like an RSP”, and regulation which says “you must follow these specific RSP-like requirements”, and I think Mikhail is talking about the latter.
I personally think the former is a good idea, and thus supported SB-1047 along with many other lab employees. It’s also pretty clear to me that locking in circa-2023 thinking about RSPs would have been a serious mistake, and so I (along with many others) am generally against very specific regulations because we expect they would on net increase catastrophic risk.
When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
AFAIK Anthropic has not unequivocally supported the idea of “you must have something like an RSP” or even SB-1047 despite many employees, indeed, doing so.
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).
Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)
And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)
This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.
Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.
You all should just call these two probabilities two different words instead of arguing which one is the correct definition for “probability”.
As the creator of the linked market, I agree it’s definitional. I think it’s still interesting to speculate/predict what definition will eventually be considered most natural.
I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.
Anthropic employees: stop deferring to Dario on politics. Think for yourself.
Do your company’s actions actually make sense if it is optimizing for what you think it is optimizing for?
Anthropic lobbied against mandatory RSPs, against regulation, and, for the most part, didn’t even support SB-1047. The difference between Jack Clark and OpenAI’s lobbyists is that publicly, Jack Clark talks about alignment. But when they talk to government officials, there’s little difference on the question of existential risk from smarter-than-human AI systems. They do not honestly tell the governments what the situation is like. Ask them yourself.
A while ago, OpenAI hired a lot of talent due to its nonprofit structure.
Anthropic is now doing the same. They publicly say the words that attract EAs and rats. But it’s very unclear whether they institutionally care.
Dozens work at Anthropic on AI capabilities because they think it is net-positive to get Anthropic at the frontier, even though they wouldn’t work on capabilities at OAI or GDM.
It is not net-positive.
Anthropic is not our friend. Some people there do very useful work on AI safety (where “useful” mostly means “shows that the predictions of MIRI-style thinking are correct and we don’t live in a world where alignment is easy”, not “increases the chance of aligning superintelligence within a short timeframe”), but you should not work there on AI capabilities.
Anthropic’s participation in the race makes everyone fall dead sooner and with a higher probability.
Work on alignment at Anthropic if you must. I don’t have strong takes on that. But don’t do work for them that advances AI capabilities.
There are certain kinds of things that it’s essentially impossible for any institution to effectively care about.
I think you should try to clearly separate the two questions of
Is their work on capabilities a net positive or net negative for humanity’s survival?
Are they trying to “optimize” for humanity’s survival, and do they care about alignment deep down?
I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won’t any measure of their social status be far higher after the singularity, if it’s found that they tried to do the best for humanity?
I’m not sure about 1. I think even they’re not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.
You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We’re all on the same ship after all, when it comes to AI alignment.
PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it’s my best guess...
There is very little hope IMHO in increase spending on technical AI alignment because (as far as we can tell based on how slow progress has been on it over the last 22 years) it is a much thornier problem than AI capability research and because most people doing AI alignment research don’t have a viable story about how they are going to stop any insights / progress they achieve from helping with AI capability research. I mean, if you have a specific plan that avoids these problems, then let’s hear it, I am all ears, but advocacy in general of increasing work on technical alignment is counterproductive IMHO.
EDIT: thank you so much for replying to the strongest part of my argument, no one else tried to address it (despite many downvotes).
I disagree with the position that technical AI alignment research is counterproductive due to increasing capabilities, but I think this is very complicated and worth thinking about in greater depth.
Do you think it’s possible, that your intuition on alignment research being counterproductive, is because you compared the plausibility of the two outcomes:
Increasing alignment research causes people to solve AI alignment, and humanity survives.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity.
And you decided that outcome 2 felt more likely?
Well, that’s the wrong comparison to make.
The right comparison should be:
Increasing alignment research causes people to improve AI alignment, and humanity survives in a world where we otherwise wouldn’t survive.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity in a world where we otherwise would survive.
In this case, I think even you would agree what P(1) > P(2).
P(2) is very unlikely because if increasing alignment research really would lead to such a superintelligence, and it really would kill humanity… then let’s be honest, we’re probably doomed in that case anyways, even without increasing alignment research.
If that really was the case, the only surviving civilizations would have had different histories, or different geographies (e.g. only a single continent with enough space for a single country), leading to a single government which could actually enforce an AI pause.
We’re unlikely to live in a world so pessimistic that alignment research is counterproductive, yet so optimistic that we could survive without that alignment research.
I believe we’re probably doomed anyways.
Sorry to disappoint you, but I do not agree.
Although I don’t consider it quite impossible that we will figure out alignment, most of my hope for our survival is in other things, such as a group taking over the world and then using their power to ban AI research. (Note that that is in direct contradiction to your final sentence.) So for example, if Putin or Xi were dictator of the world, my guess is that there is a good chance he would choose to ban all AI research. Why? It has unpredictable consequences. We Westerners (particularly Americans) are comfortable with drastic change, even if that change has drastic unpredictable effects on society; non-Westerners are much more skeptical: there have been too many invasions, revolutions and peasant rebellions that have killed millions in their countries. I tend to think that the main reason Xi supports China’s AI industry is to prevent the US and the West from superseding China and if that consideration were removed (because for example he had gained dictatorial control over the whole world) he’d choose to just shut it down (and he wouldn’t feel that need to have a very strong argument for that shutting it down like Western decision-makers would: non-Western leader shut important things down all the time or at least they would if the governments they led had the funding and the administrative capacity to do so).
Of course Xi’s acquiring dictatorial control over the whole world is extremely unlikely, but the magnitude of the technological changes and societal changes that are coming will tend to present opportunities for certain coalitions to gain and to keep enough power to shut AI research down worldwide. (Having power in all countries hosting leading-edge fabs is probably enough.) I don’t think this ruling coalition necessarily need to believe that AI presents a potent risk of human extinction for them to choose to shut it down.
I am aware that some reading this will react to “some coalition manages to gain power over the whole world” even more negatively than to “AI research causes the extinction of the entire human race”. I guess my response is that I needed an example of a process that could save us and that would feel plausible—i.e., something that might actually happen. I hasten add that there might be other processes that save us that don’t elicit such a negative reaction—including processes the nature of which we cannot even currently imagine.
I’m very skeptical of any intervention that reduces the amount of time we have left in the hopes that this AI juggernaut is not really as potent a threat to us as it currently appears. I was much much less skeptical of alignment research 20 years ago, but since then a research organization has been exploring the solution space and the leader of that organization (Nate Soares) and its most senior researcher (Eliezer) are reporting that the alignment project is almost completely hopeless. Yes, this organization (MIRI) is kind of small, but it has been funded well enough to keep about a dozen top-notch researchers on the payroll and it has been competently led. Also, for research efforts like this, how many years the team had to work on the problem is more important than the size of the team, and 22 years is a pretty long time to end up with almost no progress other than some initial insights (around the orthogonality thesis, the fragility of value, convergent instrumental values, CEV as a solution to if the problem were solvable by the current generation of human beings.
OK, if I’m being fair and balanced, then I have to concede that it was probably only in 2006 (when Eliezer figured out how to write a long intellectually-dense blog post every day) or even only in 2008 (when Anna Salamon join the organization—she was very good at recruiting and had a lot of energy to travel and to meet people) that Eliezer’s research organization could start to pick and choose among a broad pool of very talented people, but still between 2008 and now is 17 years, which again is a long time for a strong team to fail to make even a decent fraction of the progress humanity would seem to need to make on the alignment problem if in fact the alignment problem is solvable by spending more money on it. It does not appear to me to be the sort of problem than can be solved with 1 or 2 additional insights; it seems a lot more like the kind of problem where insight 1 is needed, but before any mere human can find insight 1, all the researchers need to have already known insight 2, and to have any hope of finding insight 2, they all would have had to know insight 3, and so on.
Our worldviews do not match, and I fail to see how yours makes sense. Even when I relax my predictions about the future to take in a wider set of possible paths… I still don’t get it.
AI is here. AGI is coming whether you like it or not. ASI will probably doom us.
Anthropic, as an org, seems to believe that there is a threshold of power beyond which creating an AGI more powerful than that would kill us all. OpenAI may believe this also, in part, but it seems like their expectation of where that threshold is is further away than mine. Thus, I think there is a good chance they will get us all killed. There is substantial uncertainty and risk around these predictions.
Now, consider that, before AGI becomes so powerful that utilizing it for practical purposes becomes suicide, there is a regime where the AI product gives its wielder substantial power. We are currently in that regime. The further AI gets advanced, the more power it grants.
Anthropic might get us all killed. OpenAI is likely to get us all killed. If you tryst the employees of Anthropic to not want to be killed by OpenAI… then you should realize that supporting them while hindering OpenAI is at least potentially a good bet.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
It seems to me that other possibilities exist, besides “has model with numbers” or “confused.” For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that’s not this world.
nuclear weapons have different game theory. if your adversary has one, you want to have one to not be wiped out; once both of you have nukes, you don’t want to use them.
also, people were not aware of real close calls until much later.
with ai, there are economic incentives to develop it further than other labs, but as a result, you risk everyone’s lives for money and also create a race to the bottom where everyone’s lives will be lost.
That’s a very good heuristic. I bet even Anthropic agrees with it. Anthropic did not release their newer models until OpenAI released ChatGPT and the race had already started.
That’s not a small sacrifice. Maybe if they released it sooner, they would be bigger than OpenAI right now due to the first mover advantage.
I believe they want the best for humanity, but they are in a no-win situation, and it’s a very tough choice what they should do. If they stop trying to compete, the other AI labs will build AGI just as fast, and they will lose all their funds. If they compete, they can make things better.
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter.
Even if they don’t support all the regulations you believe in, they’re the big AI company supporting relatively much more regulation than all the others.
I don’t know, I may be wrong. Sadly it is so very hard to figure out what’s good or bad for humanity in this uncertain time.
I don’t think that most people, upon learning that Anthropic’s justification was “other companies were already putting everyone’s lives at risk, so our relative contribution to the omnicide was low” would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like “it’s not okay to risk extinction, period.” That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they’d overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity.
Many ordinary people (maybe most) would have done the same thing in their shoes.
Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that’s the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won’t feel that outraged at their “consequentialism.” They may be outraged an perceived incompetence, but Anthropic definitely won’t be the only one accused of incompetence.
In your mind, is there a difference between being killed by AI developed by OpenAI and by AI developed by Anthropic? What positive difference does it make, if Anthropic develops a system that kills everyone a bit earlier than OpenAI would develop such a system? Why do you call it a good bet?
Nope.
You’re right that the local incentives are not great: having a more powerful model is hugely economically beneficial, unless it kills everyone.
But if 8 billion humans knew what many of LessWrong users know, OpenAI, Anthropic, DeepMind, and others cannot develop what they want to develop, and AGI doesn’t come for a while.
From the top of my head, it actually likely could be sufficient to either (1) inform some fairly small subset of 8 billion people of what the situation is or (2) convince that subset that the situation as we know it is likely enough to be the case that some measures to figure out the risks and not be killed by AI in the meantime are justified. It’s also helpful to (3) suggest/introduce/support policies that change the incentives to race or increase the chance of (1) or (2).
A theory of change some have for Anthropic is that Anthropic might get in position to successfully do one of these two things.
My shortform post says that the real Anthropic is very different from the kind of imagined Anthropic that would attempt to do these nope. Real Anthropic opposes these things.
What is this referring to?
People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.
Ask Jack Clark whether that happened or not.
Anthropic ppl had also said approximately this publicly. Saying that it’s too soon to make the rules, since we’d end up mispecifying due to ignorance of tomorrow’s models.
There’s a big difference between regulation which says roughly “you must have something like an RSP”, and regulation which says “you must follow these specific RSP-like requirements”, and I think Mikhail is talking about the latter.
I personally think the former is a good idea, and thus supported SB-1047 along with many other lab employees. It’s also pretty clear to me that locking in circa-2023 thinking about RSPs would have been a serious mistake, and so I (along with many others) am generally against very specific regulations because we expect they would on net increase catastrophic risk.
When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
AFAIK Anthropic has not unequivocally supported the idea of “you must have something like an RSP” or even SB-1047 despite many employees, indeed, doing so.
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).
SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.
People are arguing about the answer to the Sleeping Beauty! I thought this was pretty much dissolved with this post’s title! But there are lengthy posts and even a prediction market!
Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)
And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)
This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.
Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.
You all should just call these two probabilities two different words instead of arguing which one is the correct definition for “probability”.
As the creator of the linked market, I agree it’s definitional. I think it’s still interesting to speculate/predict what definition will eventually be considered most natural.
[RETRACTED after Scott Aaronson’s reply by email]
I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.