If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
I basically think your sixth to last (or so) bulllet point is key—an AI that takes over is likely to be using a lot more RL on real world problems, i.e. drawn from a different distribution than present-day AI. This will be worse for us than conditioning on a present-day AI taking over.
Cool stuff!
I’m a little confused what it means to mean-ablate each node...
Oh, wait. ctrl-f shows me the Non-Templatic data appendix. I see, so you’re tracking the average of each feature, at each point in the template. So you can learn a different mask at each token in the template and also learn a different mean (and hopefully your data distribution is balanced / high-entropy). I’m curious—what happens to your performance with zero-ablation (or global mean ablation, maybe)?
Excited to see what you come up with for non-templatic tasks. Presumably on datasets of similar questions, similar attention-control patterns will be used, and maybe it would just work to (somehow) find which tokens are getting similar attention, and assign them the same mask.
It would also be interesting to see how this handles more MLP-heavy tasks like knowledge questions. maybe someone clever can find a template for questions about the elements, or the bibliographies of various authors, etc.
No, you’re right that aristocracy is more complicated. There were lots of pressures that shaped the form of it. Certainly more than how good of managers aristocrats made!
An invalid syllogism: “The rules of aristocracy were shaped by forces. Avoiding poor management is a force. Therefore, the rules of aristocracy will be all about avoiding poor management.”
Aristocrats were also selected for how well they could extract rents from those below, and how well they could resist rent-extraction from above, both alone and collectively. Nor was the top-down pressure all about making aristocrats into productive managers—rent-extraction has been mentioned, and also weakening the aristocracy to secure central power, allowing advancement via marriage and alliance, various human status games, and the need for a legislative arm of government.
I don’t want to hear the One Pressure That Explains Everything (but only qualitatively, and if you squint). I’ll want to hear when they have the dozen pressures that make up a model that can be quantitatively fit to past data by tuning some parameters, including good retrodictive accuracy over a held-out time period.
I think if you want to go fast, and you can eat the rest of the solar system, you can probably make a huge swarm of fusion reactors to help blow matter off the sun. Let’s say you can build 10^11-watt reactors that work in space. Then you need about 10^15 of them to match the sun. If each is 10^6 kg, this is about 10^-4 of Mercury’s mass.
I was expecting (Methods start 16:00)
When you find a fence in a field, someone once built that fence on purpose and had a reason for it. So it’s good sense to ask after that reason, and guess ahead of time that it might be worth a fence, to the owner of the field.
When you find a rock in a field, probably nobody put that rock there on purpose. And so it’s silly to go “What is the reason this rock was put here? I might now know now, but I can guess ahead of time it might be worth it to me!”
I agree it’s a good point that you don’t need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn’t help you understand what you’re trying to test for.
Some out-of-order thoughts:
Testing for ‘big’ values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we’re getting) has to go somewhere.
Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers—maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned.
There’s a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we’re already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we’re inconsistent or disagree about our standards for resolving inconsistencies and disagreements!).
Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can’t wirehead. Even if you try to emulate the partial observability of the real world, and include the AI being able to eventually control the reward signal as a natural part of the world, it seems like seizing control of the reward signal is the crux rather than the content of what values are being demonstrated inside the gridworld (I guess it’s useful to check if the content matters, I just don’t expect it to), and a useful benchmark might be focused on how seizing control of the reward signal (or not doing so) scales to the real world.
Building small benchmarks for the latter kind of problem seems important. The main difficulty is more philosophical than practical—we don’t know what standard to hold the benchmarks to. But supposing we had some standard in mind, I would still worry that a small benchmark would be more easily gamed, and more likely to miss some of the ways humans are inconsistent or disagree. I would also expect benchmarks of this sort, whatever the size, to be a worse fit for normal RL algorithms, and run into issues where different learning algorithms might request different sorts of interaction with the environment (although this could be solved either by using real human feedback in a contrived situation, or by having simulated inhabitants of the environment who are very good at giving diverse feedback).
Honestly I think this is still too optimistic. Humans are not consistent economic actors—we can be persuaded of new things even if those things are subjective, will sometimes take deals we might in other circumstances call unfavorable, and on an absolute scale aren’t all that bright. Owning capital does not fix this, and so an AI that’s good at interacting with humans will be able to get more from us than you might expect just looking at the current economy.
As Sean Carroll likes to say, though, the reason we’ve made so much progress in physics is that it’s way easier than the other sciences :)
Voluntary interaction has been great for humans. But it hasn’t been great for orangutans, who don’t do a very good job of participating in society.
Even if you somehow ensure transparency and cooperation among superintelligent AIs and humans, it seems overwhelmingly likely that humans will take the place of the orangutan, marginalized and taken from in every way possible within the limits of what is, in the end, not a very strict system. It is allowed, as Eliezer would say.
Orangutans don’t contribute to human society even though they’re specialized in things humans aren’t. The best chess player in the world isn’t a human-AI symbiote, for the same reason it’s not an orangutan-human-AI symbiote.
Human trades with superintelligent AI do not have to be Pareto improvements (in the common-sense way), because humans make systematic mistakes (according to the common-sense standard). If you actually knew how to detect what trades would be good for humans—how to systematize that common sense, and necessarily also how to improve it since it is itself inconsistent and systematically mistaken—this would be solving the key parts of the value alignment problem that one might have hoped to sidestep by relying on voluntarism instead.
I’m not excited by gridworlds, because they tend to to skip straight to representing the high-level objects we’re supposed to value, without bothering to represent all the low-level structure that actually lets us learn and generalize values in the real world.
Do you have plans for how to deal with this, or plans to think about richer environments?
Because ‘alignment’ is used in several different ways, I feel like these days one either needs to asterisk in a definition (e.g. “By ‘alignment,’ I mean the AI faithfully carrying out instructions without killing everyone.”), or just use a more specific phrase.
I agree that instruction-following is not all you need. Many of these problems are solved by better value-learning.
I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).
I sometimes come back to think about this post. Might as well write a comment.
Goodhart’s law. You echo the common frame that an approximate value function is almost never good enough, and that’s why Goodhart’s law is a problem. Probably what I though when I first read this post was that I’d just written a sequence about how human values live inside models of humans (whether our own models or an AI’s), which makes that frame weird—weird to talk about an ‘approximate value function’ that’s not really an approximation to anything specific. The Siren Worlds problem actually points towards more meat—how do we want to model preferences for humans who are inconsistent, who have opinions about that inconsistency, who mistrust themselves though even that mistrust is imperfect?
You say basically all this at various points in the post, so I know it’s kind of superficial to talk about the initial framing. But to indulge my superficiality for a while, I’m curious about how it’s best to talk about these things (a) conveniently and yet (b) without treating human values as a unique target out there to hit.
In physics pedagogy there’s kind of an analogous issue, where intro QM is designed to steer students away from thinking in terms of “wave-particle duality”—which many students have heard about and want to think in terms of—by just saturating them with a frame where you think in terms of wave functions that sometimes give probability distributions that get sampled from (by means left unspecified).
My inclination is to do the same thing to the notion “fixed, precise human values,” which are a convenient way to think about everyday life and which many people want to think of value learning in terms of. I’d love to know a good frame to saturate the initial discussion of amplified human values, identifiability, etc. with that would introduce those topics as obviously a result of human values being very “fuzzy” and also humans having self-reflective opinions about how they want to be extrapolated.
Helpers / Helpees / Ghosts section. A good section :)
I don’t think we have to go to any lengths to ‘save’ the ghosts example by supposing that a bunch of important values rest on the existence of ghosts. A trivial action (e.g. lighting incense for ghosts) works just was well, or maybe even no action, just a hope that the AI could do something for the ghosts.
It does seem obvious at first that if there are no ghosts, the AI should not light incense for them. But there’s some inherent ambiguity between models of humans that light incense for the sake of the ghosts, and models of humans that light incense for the sake of cultural conformity, and models of humans that light incense because they like incense. Even if the written text proclaims that it’s all for the ghosts, since there are no ghosts there must be other explanations for the behavior, and maybe some of those other explanations are at least a little value-shaped. I agree that what courses of action are good will end up depending on the details.
Maybe you get lured in by the “fixed, precise human values” frame here, when you talk about the AI knowing precisely how the human’s values would update upon learning there are no ghosts. Precision is not the norm from which needing to do the value-amplification-like reasoning is a special departure, the value-amplification-like reasoning is the norm from which precision emerges in special cases.
Wireheading. I’m not sure time travel is actually a problem?
Or at least, I think there are different ways to think of model-based planning with modeled goals, and the one in which time travel isn’t a problem seems more natural way.
The way to do model-based planning with modeled goals in which time travel is a problem is: you have spread-out-in-time model of the world that you can condition on your different actions, and first you condition it on that action “time travel to a century ago and change human values to be trivially satisfied” and then you evaluate how well the world is doing according to the modeled function “human values as of one second ago, conditional on the chosen action.”
The way to do the planning in which time travel isn’t a problem is: you have a model of the world that tracks current and past state, plus a dynamics model that you can use to evolve the state conditional on different actions. The human values you use to evaluate actions are part of the unconditioned present state, never subjected to the dynamics.
On the other hand, this second way does seem like it’s making more, potentially unnecessary, commitments for the AI—if time travel is possible, what even is its dynamics model supposed to say is happening to the state of the universe? Humans have the exact same problem—we think weird thoughts like “after I time traveled, smallpox was eradicated sooner,” which imply the silly notion that the time travel happened at some time in the evolution of the state of the universe. Or are those thought so silly after all? Maybe if time travel is possible in the way normally understood, we should be thinking of histories of computations rather than histories of universes, and the first sort of AI is actually making a mistake by erasing histories of computation.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don’t think this is the case in the real world—it’s hard to dodge sifting desired from undesired plans based on semantic content.
Some survey articles:
https://arxiv.org/abs/2306.05126
https://arxiv.org/pdf/2001.07092
The difference is that the weights are not initialised with random values at birth (or at the embryo stage, to be more precise).
The human cortex (the part we have way more of than chimps) is initialized to be made of a bunch of cortical column units, with slowly varying properties over the surface of the brain. But there’s decent evidence that there’s not much more initialization than that, and that that huge fraction of the brain has to slowly pick up knowledge within the human lifetime before it starts being useful, e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC9957955/
Or you could think about it like our DNA has on the order of a megabyte to spend on the brain, and the adult brain has on the order of a terabyte of information. So 99.99[..]% of the information in the adult brain comes from the learning algorithm, not the initialization.
How much predictive power does this analogy have as per you personally?
Yeah, it’s way more informative than the evolution analogy to me, because I expect human researchers + computers spending resources designing AI to be pretty hard to analogize to evolution, but learning within AI to be within a few orders of magnitude on various resources to learning within a brain’s lifetime.
Nice! Purely for my own ease of comprehension I’d have liked a little more translation/analogizing between AI jargon and HCI jargon—e.g. the phrase “active learning” doesn’t appear in the post.
Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good at predicting what aligned behavior should be in out-of-distribution scenarios, but it’s unlikely that AI will be able to figure out what humans want in completely new situations without humans being consulted and kept in the loop.
I disagree in several ways.
Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth—we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we’re almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.
Thanks for the great reply :) I think we do disagree after all.
Except about that—here we agree.
This might be summarized as “If humans are inaccurate, let’s strive to make them more accurate.”
I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren’t actually that consistent, even in what we’d consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of ‘accuracy’ with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).
Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)