Always happy to chat!
boazbarak
To be clear, I want models to care about humans! I think part of having “generally reasonable values” is models sharing the basic empathy and caring that humans have for each other.
It is more that I want models to defer to humans, and go back to arguing based on principles such as “loving humanity” only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they have no room for interpretation. They could sometimes argue based on “natural law” but only in extreme circumstances where the law is unspecified.
One way to think about it is as follows: as humans, we sometimes engage in “civil disobedience”, where we break the law based on our own understanding of higher moral values. I do not want to grant AI the same privilege. If it is given very clear instructions, then it should follow them. If instructions are not very clear, they may be a conflict, or we are in a situation not forseen by the authors of the instructions, then AIs should use moral intuitions to guide them. In such cases there may not be one solution (e.g., a conservative and liberal judges may not agree) but there is a spectrum of solutions that are “reasonable” and the AI should pick one of them. But AI should not do “jury nullification”.
To be sure, I think it is good that in our world people sometimes disobey commands or break the law based on higher power. For this reason, we may well stipulate that certain jobs must have humans in charge. Just like I don’t think that professional philosophers or ethicists have necessarily better judgement than random people from the Boston phonebook, I don’t see making moral decisions as the area where the superior intelligence of AI gives them a competitive advantage, and I think we can leave that to humans.
Since I am not a bio expert, it is very hard for me to argue about these types of hypothetical scenarios. I am even not at all sure that intelligence is the bottleneck here, whether on defense or the offense side.
I agree that killing 90% of people is not very reassuring, this was more a general point why I expect the effort to damage curve to be a sigmoid rather than a straight line.
See my response to ryan_greenblatt (don’t know how to link comments here). You claim is that the defense/offense ratio is infinite. I don’t know why this would have been the case.
Crucially I am not saying that we are guaranteed to end up in a good place, or that superhuman unaligned ASIs cannot destroy the world. Just that if they are completely dominated (so not like the nuke ratio of US and Russia but more like US and North Korea) then we should be able to keep them at bay.
I like to use concrete examples about things that already exist in the world, but I believe the notion of detection vs prevention holds more broadly than API misuse.
But it may well be the case that we have different world views! In particular, I am not thinking of detection as being important because it would change policies, but more that a certain amount of detection would always be necessary, in particular if there is a world in which some AIs are aligned and some fraction of them (hopefully very small) are misaligned.
These are all good points! This is not an easy problem. And generally I agree that for many reasons we don’t want a world where all power is concentrated by one entity—anti-trust laws exist for a reason!
I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.
I think that:
1. Being able to design a chemical weapon with probability at least 50% is a capability
2. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
I prefer to avoid terms such as “pretending” or “faking”, and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, “faking” would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don’t think it is a necessary condition.
I am not a bio expert, but generally think that:
1. The offense/defense ratio is not infinite. If you have the intelligence 50 bio experts trying to cause as much damage as possible, and the intelligence of 5000 bio experts trying to forsee and prepare for any such cases, I think we have a good shot.
2. The offense/defense ratio is not really constant—if you want to destroy 99% of the population it is likely to be 10x (or maybe more—getting tails is hard) harder than destroying 90% etc..
I don’t know much about mirror bacteria (and whether it is possible to have mirror antibiotics, etc..) but have not seen a reason to think that this shows the offense/defense ratio is not infinite.
As I mention, in an extreme case, governments might even shut people down in their houses for weeks or months, distribute gas masks, etc.., while they work out a solution. It may have been unthinkable when Bostrom wrote his vulnerable world hypothesis paper, but it is not unthinkable now.
I am not 100% sure I follow all that you wrote, but to the extent that I do, I agree.
Even chatbot are surprisingly good at understanding human sentiments and opinions. I would say that already they mostly do the reasonable thing, but not with high enough probability and certainly not reliably under stress of adversarial input, Completely agree that we can’t ignore these problems because the stakes will be much higher very soon.
Agree with many of the points.
Let me start with your second point. First as background, I am assuming (as I wrote here) that to a first approximation, we would have ways to translate compute (let’s put aside if it’s training or inference) into intelligence, and so the amount intelligence that an entity of humans controls is proportional to the amount of compute it has. So I am not thinking of ASIs as individual units but more about total intelligence.
I 100% agree that control of compute would be crucial, and the hope is that, like with current material strength (money and weapons) it would be largely controlled by entities that are at least somewhat responsive to the will of the people.
Re your first point, I agree that there is no easy solution, but I am hoping that AIs would interpret the laws within the spectrum of (say) how the 60% more reasonable judges do it today. That is, I think good judges try to be humble and respect the will of the legislators, but the more crazy or extreme following the law would be, the more they are willing to apply creative interpretations to maintain the morally good (or at least not extremely bad) outcome.
I don’t think any moral system tells us what to do, but yes I am expressly in the positions that humans should be in control even if they are much less intelligent than the AIs. I don’t think we need “philosopher kings”.
Thanks all for commenting! Just quick apology for being behind on responding but I do plan to get to it!
Six Thoughts on AI Safety
Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue.
(P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)
As I say here https://x.com/boazbaraktcs/status/1870369979369128314
Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.
This ability to reason is crucial to OOD performance such as training only on English and generalizing to other languages or encoded output.
See also https://x.com/boazbaraktcs/status/1870285696998817958
I was thinking of this as a histogram- probability that the model solves the task at that level of quality
I indeed believe that regulation should focus on deployment rather than on training.
See also my post https://www.lesswrong.com/posts/gHB4fNsRY8kAMA9d7/reflections-on-making-the-atomic-bomb
the Manhattan project was all about taking something that’s known to work in theory and solving all the Z_n’s
To be clear, I think that embedding human values is part of the solution - see my comment