In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we’ll be able to do automated alignment of ASI, you’ll still need some reliable approach to “manual” alignment of current systems. We’re already far past the point where we can robustly verify LLMs claims’ or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad’s agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad’s ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that’s basically it. And MIRI abandoned agent foundations, while Openphil isn’t, it seems, putting money or effort into them.
Davidmanheim
I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.
I really like that idea, and the clarity it provides, and have renamed the post to reflect it! (Sorryr this was so slow- I’m travelling.)
That seems fair!
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
On the absolute safety, I very much like the way you put it, and will likely use that framing in the future, so thanks!
On impossibility results, there are some, andI definitely think that this is a good question, but also agree this isn’t quite the right place to ask. I’d suggest talking to some of the agents foundations people for suggestions
I think these are all really great things that we could formalize and build guarantees around. I think some of them are already ruled out by the responsibility sensitive safety guarantees, but others certainly are not. On the other hand, I don’t think that use of cars to do things that violate laws completely unrelated to vehicle behavior are in scope; similar to what I mentioned to Oliver, if what is needed in order for a system to be safe is that nothing bad can be done, you’re heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes.
But in cases where provable safety guarantees are in place, and the issues relate to car behavior—such as cars causing damage, blocking roads, or being redirected away from the intended destination—I think hardware guarantees on the system, combined with software guarantees, combined with verifying that only trusted code is being run, could be used to ignition-lock cars which have been subverted.
And I think that in the remainder of cases, where cars are being used for dangerous or illegal purposes, we need to trade off freedom and safety. I certainly don’t want AI systems which can conspire to break the law—and in most cases, I expect that this is something LLMs can already detect—but I also don’t want a car which will not run if it determines that a passenger is guilty of some unrelated crime like theft. But for things like “deliver explosives or disperse pathogens,” I think vehicle safety is the wrong path to preventing dangerous behavior; it seems far more reasonable to have separate systems that detect terrorism, and separate types of guarantees to ensure LLMs don’t enable that type of behavior.
Yes, after saying it was about what they need “to do not to cause accidents” and that “any accidents which could occur will be attributable to other cars actions,” which I then added caveats to regarding pedestrians, I said “will only have accidents” when I should have said “will only cause accidents.” I have fixed that with another edit. But I think you’re confused about what I’m trying to show .
Principally, I think you are wrong about what needs to be shown here for safety in the sense I outlined, or are trying to say that the sense I outlined doesn’t lead to something I don’t claim. If what is needed in order for a system to be safe is that no damage will be caused in situations which involve the system, you’re heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes. My claim, on the other hand, is that in sociotechnological systems, the way that safety is achieved is by creating guarantees that each actor—human or AI—behaves according to rules that minimizes foreseeable dangers. That would include safeguards for stupid, malicious, or dangerous human actions, much like human systems have laws about dangerous actions. However, in a domain like driving, in the same way that it’s impossible for human drivers to both get where they are going, and never hit pedestrians who act erratically and jump out from behind obstacles into the road with an oncoming car, a safe autonomous vehicle wouldn’t be expected to solve every possible case of human misbehavior—just to drive responsibly.
More specifically, you make the claim that “as far as I can tell it would totally be compatible with a car driving extremely recklessly in a pedestrian environment due to making assumptions about pedestrian behavior that are not accurate.” The paper, on the other hand, says “For example, in a typical residential street, a pedestrian has the priority over the vehicles, and it follows that vehicles must yield and be cautious with respect to pedestrians,” and formalizes this with statements like “a vehicle must be in a kinematic state such that if it will apply a proper response (acceleration for ρ seconds and when braking) it will remain outside of a ball of radius 50cm around the pedestrian.”I also think that it formalizes reasonable behavior for pedestrians, but I agree that it won’t cover every case—pedestrians oblivious to cars that are driving in ways that are otherwise safe, who rapidly change their path to jump in front of cars, are sometimes able to be hit by those cars—but I think fault is pretty clear here. (And the paper is clear that even in those cases, the car would need to both drive safely in residential areas, and attempt to brake or avoid the pedestrian in order to avoid crashes even in cases with irresponsible and erratic humans!)
But again, as I said initially, this isn’t solving the general case of AI safety, it’s solving a much narrower problem. And if you wanted to make the case that this isn’t enough for similar scenarios that we care about, I will strongly agree that for more capable systems, the set of situations it would need to avoid are correspondingly larger, and the set of necessary guarantees are far stronger. But as I said at the beginning, I’m not making that argument—just the much simpler one that proveability can work in physical systems, and can be applied in sociotechnological systems in ways that make sense.
I agree that “safety in an open world cannot be proved,” at least as a general claim, but disagree that this impinges on the narrow challenge of designing cars that do not cause accidents—a misunderstanding which I tried to be clear about, but which I evidently failed to make sufficiently clear, as Oliver’s misunderstanding illustrates. That said, I strongly agree that better methods for representing grain of truth problems, and considering hypotheses outside those which are in the model is critical. It’s a key reason I’m supporting work on infra-Bayesian approaches, which are designed explicitly to handle this class of problem. Again, it’s not necessary for the very narrow challenge I think I addressed above, but I certainly agree that it’s critical.
Second, I’m a huge proponent of complex system engineering approaches, and have discussed this in previous unrelated work. I certainly agree that these issues are critical, and should receive more attention—but I think it’s counterproductive to try to embed difficult problems inside of addressable ones. To offer an analogy, creating provably safe code that isn’t vulnerable to any known technical exploit still will not prevent social engineering attacks, but we can still accomplish the narrow goal.
If, instead of writing code that can’t be fuzzed for vulnerabilities, doesn’t contain buffer overflow or null-pointer vulnerabilities, and can’t be exploited via transient execution CPU vulnerabilities, and isn’t vulnerable to rowhammer attacks, you say that we need to address social engineering before trying to make the code provably safe, and should address social engineering with provable properties, you’re sabotaging progress in a tractable area in order to apply a paradigm ill-suited to the new problem you’re concerned with.
That’s why, in this piece, I started by saying I wasn’t proving anything general, and “I am making far narrower claims than the general ones which have been debated.” I agree that the larger points are critical. But for now, I wanted to make a simpler point.
To start at the end, you claim I “straightforwardly made an inaccurate unqualified statement,” but replaced my statement about “what a car needs to do not to cause accidents” with “no accidents will take place.” And I certainly agree that there is an “extremely difficult and crucial step of translating a form toy world like RSS into real world outcomes,” but the toy model that the paper is dealing with is therefore one of rule-following entities, both pedestrians and cars. That’s why it’s not going to require accounting for “what if pedestrians do something illegal and unexpected.”
Of course, I agree that this drastically limits the proof, or as I said initially, “relying on assumptions about other car behavior is a limit to provable safety,” but you seem to insist that because the proof doesn’t do something I never claimed it did, it’s glossing over something.
That said, I agree that I did not discuss pedestrians, but as you sort-of admit, the paper does—it treats stationary pedestrians not at crosswalks, and not on sidewalks, as largely unpredictable entities that may enter the road. For example, it notes that “even if pedestrians do not have priority, if they entered the road at a safe distance, cars must brake and let them pass.” But again, you’re glossing over the critical assumption for the entire section, which is responsibility for accidents. And this is particularly critical; the claim is not that pedestrians and other cars cannot cause accidents, but that the safe car will not do so.
Given all of that, to get back to the beginning, your initial position was that “RSS seems miles away from anything that one could describe as a formalization of how to avoid an accident.” Do you agree that it’s close to “a formalization of how to avoid causing an accident”?
Have you reviewed the paper? (It is the first link under “The RSS Concept” in the page which was linked to before, though perhaps I should have linked to it directly.) It seems to lay out the proof, and discusses pedestrians, and deals with most of the objections you’re raising, including obstructions and driving off of marked roads. I admit I have not worked through the proof in detail, but I have read through it, and my understanding is that it was accepted, and a large literature has been built that extends it.
And the objections about slippery roads and braking are the set of things I noted under “traditional engineering analysis and failure rates” I agree that guarantees are non-trivial, but they also aren’t outside of what is done already in safety analysis, and there is explicit work in the literature on the issue, both from the verification and validation side, and from the perception and sensing weather conditions side.
I agree that it’s the most challenging part, and there are unsolved problems, but I don’t share your intuition that it’s in some way unsolvable, so I suspect we’re thinking of very different types of things.
For RSS specifically, Rule 5 is obviously the most challenging, but it’s also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules—respecting distance and not hitting things—are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn’t changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.
As you sort of refer to, it’s also the case that the 7.5 hour run time can be paid once, and then remain true of the system. It’s a one-time cost!
So even if we have 100 different things we need to prove for a higher level system, then even if it takes a year of engineering and mathematics research time plus a day or a month of compute time to get a proof, we can do them in parallel, and this isn’t much of a bottleneck, if this approach is pursued seriously. (Parallelization is straightforward if we can, for example, take the guarantee provided by one proof as an assumption in others, instead of trying to build a single massive proof.) And each such system built allows for provability guarantees for systems build with that component, if we can build composable proof systems, or can separate the necessary proofs cleanly.
Yes—I didn’t say it was hard without AI, I said it was hard. Using the best tech in the world, humanity doesn’t *even ideally* have ways to get AI to design safe useful vaccines in less than months, since we need to do actual trials.
I know someone who has done lots of reporting on lab leaks, if that helps?
Also, there are some “standard” EA-adjacent journalists who you could contact / someone could introduce you to, if it’s relevant to that as well.
Vaccine design is hard, and requires lots of work. Seems strange to assert that someone could just do it on the basis of a theoretical design. Viral design, though, is even harder, and to be clear, we’ve never seen anyone build one from first principles; the most we’ve seen is modification of extant viruses in minor ways where extant vaccines for the original virus are likely to work at least reasonably well.
I have a lot more to say about this, and think it’s worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark’s more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn’t engage with the specific proposal at all.
To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are presented, and then this steel-manned version, rather than the initial proposal for formal verification, is attacked. But this amounts to a straw-man criticism of the actual proposals being discussed!
For example, this post suggests that arbitrary DNA could be proved safe by essentially impossible modeling (“on-demand physical simulations of entire human bodies (with their estimated 36 trillion cells [9]), along with the interactions between the cells themselves and the external world and then run those simulations for years”). This is true, that would work—but the proposal ostensibly being criticized was to check narrower questions about whether DNA synthesis is being used to produce something harmful. And Dalyrmple et al explained explicitly what they might have included elsewhere in the paper (“Examples include machines provably impossible to login to without correct credentials, DNA synthesizers that provably cannot synthesize certain pathogens, and AI hardware that is provably geofenced, time-limited (“mortal”) or equipped with a remote-operated throttle or kill-switch. Provably compliant sensors can be specified to ensure “zeroization”, in which tampering with PCH is guaranteed to cause detection and erasure of private keys.”)
But that premise falls apart as soon as a large fraction of those (currently) highly motivated (relatively) smart tech workers can only get jobs in retail or middle management.
Yeah, I think the simplest thing for image generation is for model hosting providers to use a separate tool—and lots of work on that already exists. (see, e.g., this, or this, or this, for different flavors.) And this is explicitly allowed by the bill.
For text, it’s harder to do well, and you only get weak probabilistic identification, but it’s also easy to implement an Aaronson-like scheme, even if doing it really well is harder. (I say easy because I’m pretty sure I could do it myself, given, say, a month working with one of the LLM providers, and I’m wildly underqualified to do software dev like this.)
Question for a lawyer: how is non-reciprocity not an interstate trade issue that federal courts can strike down?