Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.
davidad
I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.
-
I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)
-
The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.
-
If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.
-
I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.
Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.
Overall the paper is good and I’m glad you’re doing it!
-
Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.
In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.
From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it’s not clear whether
you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring “function” was actually stochastic and “defined” by the physical process of humans interacting with the AI’s actions and clicking
Merge
buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.
I tend to favor the latter interpretation—I’d say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have “programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon” is the decomposition into Reward Specification and Adequate Policy Learning really useful.
I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.
Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.
I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).
If we were still using 10-digit decimal words like the original ENIAC and other early computers, I’d be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.
The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.
Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of
natural abstractions (so the best formal representations could be shockingly compact)
code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans)
AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)
Shouldn’t we plan to build trust in AIs in ways that don’t require humans to do things like vet all changes to its world-model?
Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.
In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):
Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire network, not just the most interpretable parts. If it proves a property, that property is true about the actual neural network. The Model-Checking Feasibility Hypothesis says that this is feasible, regardless of the infeasibility of a human understanding the policy or any details of the proof. (We would rely on a verified verifier for the proof, of which humans would understand the details.)
Factoring learned information through human understanding. If we denote learning by , human understanding by , and big effects on the world by , then “factoring” means that (for some and ). This is in the same spirit as “human in the loop,” except not for the innermost loops of real-time action. Here, the Scientific Sufficiency Hypothesis implies that even though is “larger” than in the sense you point out, we can throw away the parts that don’t fit in and move forward with a fully-understood world model. I believe this is likely feasible for world models, but not for policies (optimal policies for simple world models, like Go, can of course be much better than anything humans understand).
Strong upvoted.
Bird’s eye perspective: All information theory is just KL-divergence and priors, all priors are just Gibbs measures, and algorithmic information theory is just about how computational costs should be counted as “energy” in the Gibbs measure (description length vs time vs memory, etc).
Frog’s eye perspective: taking the pushforward measure along the semantics of the language collects all the probability mass of each ’s entire extensional-equivalence-class; no equally natural operation collects only the mass from the single maximum-probability representative.
Both seem underrated.
I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.)
Engineering understanding would happen in step 2, and I think engineering is more “the generator of large effects on the world,” the place where much-faster-than-human ingenuity is needed, rather than hoping to find new science.
(Although the formalization of the model of scientific reality is important for the overall proposal—to facilitate validating that the engineering actually does what is desired—and building such a formalization would be hard for unaided humans.)
For most s I agree, but the existence of any adversarial examples for would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)
I find Eric Drexler’s arguments convincing about how it seems possible to make collusion very unlikely. On the other hand, I do think it requires nontrivial design and large ensembles; in the case of an unconstrained 2-player game (like Safety via Debate), I side with Eliezer that the probability of collusion probably converges toward 1 as capabilities get more superintelligent.
Another key principle that I make use of is algorithms (such as branch-and-bound and SMT solvers) whose performance—but not their correctness—depends on extremely clever heuristics. Accelerating the computation of more accurate and useful bounds seems to me like a pretty ineffectual causal channel for the AIs playing those heuristic roles to coordinate with each other or to seek real-world power.
What about collusion?
Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what’s actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it’s “inner misalignment” or “goal misgeneralization” or “deceptive alignment” or “demons in Solmonoff induction” or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.
What about inner misalignment?
Nate [replying to Eric Drexler]: I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there. Like, I see this plan as basically saying “yep, that hard problem is in fact too hard, let’s try to dodge it, by having humans + narrow AI services perform the pivotal act”. Setting aside how I don’t particularly expect this to work, we can at least hopefully agree that it’s attempting to route around the problems that seem to me to be central, rather than attempting to solve them.
I think, in an open agency architecture, the silo that gets “enough AGI” is in step 2, and it is pointed at the desired objective by having formal specifications and model-checking against them.
But I also wouldn’t object to the charge that an open agency architecture would “route around the central problem,” if you define the central problem as something like building a system that you’d be happy for humanity to defer to forever. In the long run, something like more ambitious value learning (or value discovery) will be needed, on pain of astronomical waste. This would be, in a sense, a compromise (or, if you’re optimistic, a contingency plan), motivated by short timelines and insufficient theoretical progress toward full normative alignment.
GPT-3.5-Nate [prompted using 1061 tokens of Nate’s criticism of related ideas]: This is a really interesting proposal! It seems to me that you’re trying to solve the hard problem of AI alignment, by proposing a system that would be able to identify the concepts that an AGI should be optimizing for, and then use a combination of model-checking and time-bounded optimization to ensure that the AGI’s behavior is robustly directed at those concepts. This is a really ambitious project, and I’m excited to see how it develops.
🙃
How would you respond to predicted objections from Nate Soares?
For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.