If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Yeah, that’s true. I expect there to be a knowing/wanting split—AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of “alignment”, or make other long-term predictions, but that doesn’t mean it’s using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it’s acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible.
In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less “we need to hurry and use the AI to do our work for us” and more “we’re executing a shared human-AI gameplan for learning human values that are good by human standards.”
In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining “what humans want” in a way that captures many of the ‘advantages’ of deception for maximizing reward without triggering our interpretability tools that are looking for deception.)
I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren’t defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
First, I agree with Dmitry.
But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there’s a very real quantity which is “how many more bits do you need to exactly reconstruct X, given Z?” My suspicion is that for a fixed network, this quantity grows linearly with N (and if it’s zero at ‘actual infinity’ for some network despite being nonzero in the limit, maybe we should ignore actual infinity).
But this isn’t all that useful, it would be nicer to have an information that converges. The divergence seems a bit silly, too, because it seems silly to treat the millionth digit as as important as the first.
So suppose you don’t want to perfectly reconstruct X. Instead, maybe you could say the distribution of X is made of some fixed number of bins or summands, and you want to figure out which one based on Z. Then you get a converging amount of information, and you correctly treat small numbers as less important, but you’ve had to introduce this somewhat arbitrary set of bins. shrug
A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states.
I wonder if this can be resolved by treating the randomness of the machines quantum mechanically, rather than having this semi-classical picture where you start with some randomness handed down from God. Suppose these machines use quantum mechanics to do the randomization in the simplest possible way—they have a hidden particle in state |left>+|right> (pretend I normalize), they mechanically measure it (which from the outside will look like getting entangled with it) and if it’s on the left they emit their first option (|0> or |+> depending on the machine) and vice versa.
So one system, seen from the outside, goes into the state |L,0>+|R,1>, the other one into the state |L,0>+|R,0>+|L,1>-|R,1>. These have different density matrices. The way you get down to identical density matrices is to say you can’t get the hidden information (it’s been shot into outer space or something). And then when you assume that and trace out the hidden particle, you get the same representation no matter your philosophical opinion on whether to think of the un-traced state as a bare state or as a density matrix. If on the other hand you had some chance of eventually finding the hidden particle, you’d apply common sense and keep the states or density matrices different.
Anyhow, yeah, broadly agree. Like I said, there’s a practical use for saying what’s “real” when you want to predict future physics. But you don’t always have to be doing that.
people who study very “fundamental” quantum phenomena increasingly use a picture with a thermal bath
Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you’ve just shifted the vibe from “the Boltzmann distribution is in the territory” to “the Boltzmann distribution is in the map.”
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put “into theory” (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.
Sure, at some level of description it’s useful to say that measurement is irreversible, just like at some level of description it’s useful to say entropy always increases. Just like with entropy, it can be derived from boundary conditions + reversible dynamics + coarse-graining. Treating measurements as reversible probably has more applications than treating entropy as reversible, somewhere in quantum optics / quantum computing.
Some combination of:
Interpretability
Just check if the AI is planning to do bad stuff, by learning how to inspect its internal representations.
Regularization
Evolution got humans who like Doritos more than health food, but evolution didn’t have gradient descent. Use regularization during training to penalize hidden reasoning.
Shard / developmental prediction
Model-free RL will predictably use simple heuristics for the reward signal. If we can predict and maybe control how this happens, this gives us at least a tamer version of inner misalignment.
Self-modeling
Make it so that the AI has an accurate model of whether it’s going to do bad stuff. Then use this to get the AI not to do it.
Control
If inner misalignment is a problem when you use AI’s off-distribution and give them unchecked power, then don’t do that.
Personally, I think the most impactful will be Regularization, then Interpretability.
The real chad move is to put “TL;DR: See above^” for every section.
When you say there’s “no such thing as a state,” or “we live in a density matrix,” these are statements about ontology: what exists, what’s real, etc.
Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by “explain with ontology.”
Epistemology is about how we know stuff. If we start with a world that does not inherently have a probability distribution attached to it, but obtain a probability distribution from arguments about how we know stuff, that’s “explain with epistemology.”
In quantum mechanics, this would look like talking about anthropics, or what properties we want a measure to satisfy, or solomonoff induction and coding theory.
What good is it to say things are real or not? One useful application is predicting the character of physical law. If something is real, then we might expect it to interact with other things. I do not expect the probability distribution of a mixed state to interact with other things.
Treating the density matrix as fundamental is bad because you shouldn’t explain with ontology that which you can explain with epistemology.
Be sad.
For topological debate that’s about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the “strictest” combination, a big class of fatal flaw would be if you don’t actually have the partial order you think you have within the practical range of the settings—i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.
In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they’re rare, this is pretty easy to defend against.
In the fine-grained plane example, though, there’s a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid “maximum resolution across the entire plane,” and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them.
(This makes possible a somewhat funny scene, where the operator expected the true player’s bid to look “normal,” and then goes to check the bids and both look like alien noise patterns.)
An egregious case would be where it’s harder to disrupt patterns injected during bids—e.g. if the players’ bids are ‘sparse’ / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way.
I guess for a lot of “tough real world questions,” the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attempt before we have to start worrying about this kind of ‘fatal flaw’. But anything involving biology, human judgment, or too much computer code seems tough. “Does this gene therapy work?” might be something you could at least imagine a simulator for that still seems like it gives lots of opportunity for the false player.
Fun post, even though I don’t expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).
With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I’m not at all confused about why people would think this is useful safety work, even if I haven’t personally hashed out the cost/benefit to any degree of confidence.
I’m mostly confused about ML / SWE / research benchmarks.
The mathematical structure in common is called a “measure.”
I agree that there’s something mysterious-feeling about probability in QM, though I mostly think that feeling is an illusion. There’s a (among physicists) famous fact that the only way to put a ‘measure’ on a wavefunction that has nice properties (e.g. conservation over time) is to take the amplitude squared. So there’s an argument: probability is a measure, and the only measure that makes sense is the amplitude-squared measure, therefore if probability is anything it’s the amplitude squared. And it is! Feels mysterious.
But after getting more used to anthropics and information theory, you start to accumulate more arguments for the same thing that take it from a different angle, and it stops feeling so mysterious.
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?
It’s not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I’m not clear.
But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.
But clearly other people think differently than me.
One big reason I might expect an AI to do a bad job at alignment research is if it doesn’t do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren’t that inconsistent about it and don’t disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because ‘be very convincing to your operators’ is an objective with a different kind of danger).
Another potential red flag is if the AI gives humans what they ask for even when that’s ‘dumb’ according to some sophisticated understanding of human values. This could definitely show up in string theory research (note when some ideas suggest non-string-theory paradigms might be better, and push back on the humans if the humans try to ignore this), it’s just intellectually difficult (maybe easier in loop quantum gravity research heyo gottem) and not as salient without the context of alignment and human values.
Thanks for the great reply :) I think we do disagree after all.
humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans
Except about that—here we agree.
Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values.
This might be summarized as “If humans are inaccurate, let’s strive to make them more accurate.”
I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren’t actually that consistent, even in what we’d consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of ‘accuracy’ with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).
Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.
A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)
I basically think your sixth to last (or so) bulllet point is key—an AI that takes over is likely to be using a lot more RL on real world problems, i.e. drawn from a different distribution than present-day AI. This will be worse for us than conditioning on a present-day AI taking over.
Cool stuff!
I’m a little confused what it means to mean-ablate each node...
Oh, wait. ctrl-f shows me the Non-Templatic data appendix. I see, so you’re tracking the average of each feature, at each point in the template. So you can learn a different mask at each token in the template and also learn a different mean (and hopefully your data distribution is balanced / high-entropy). I’m curious—what happens to your performance with zero-ablation (or global mean ablation, maybe)?
Excited to see what you come up with for non-templatic tasks. Presumably on datasets of similar questions, similar attention-control patterns will be used, and maybe it would just work to (somehow) find which tokens are getting similar attention, and assign them the same mask.
It would also be interesting to see how this handles more MLP-heavy tasks like knowledge questions. maybe someone clever can find a template for questions about the elements, or the bibliographies of various authors, etc.
I certainly agree that Asimov’s three laws are not a good foundation for morality! Nor are any other simple set of rules.
So if that’s how you mean “value alignment,” yes let’s discount it. But let me sell you on a different idea you haven’t mentioned, which we might call “value learning.”[1]
Doing the right thing is complicated.[2] Compare this to another complicated problem: telling photos of cats from photos of dogs. You cannot write down a simple set of rules to tell apart photos of cats and dogs. But even though we can’t solve the problem with simple rules, we can still get a computer to do it. We show the computer a bunch of data about the environment and human classifications thereof, have it tweak a bunch of parameters to make a model of the data, and hey presto, it tells cats from dogs.
Learning the right thing to do is just like that, except for all the ways it’s different that are still open problems:
Humans are inconsistent and disagree with each other about the right thing more than they are inconsistent/disagree about dogs and cats.
If you optimize for doing the right thing, this is a bit like searching for adversarial examples, a stress test that the dog/cat classifier didn’t have to handle.
When building an AI that learns the right thing to do, you care a lot more about trust than when you build a dog/cat classifier.
This margin is too small to contain my thoughts on all these.
There’s no bright line between value learning and techniques you’d today lump under “reasonable compliance.” Yes, the user experience is very different between (e.g.) an AI agent that’s operating a computer for days or weeks vs. a chatbot that responds to you within seconds. But the basic principles are the same—in training a chatbot to behave well you use data to learn some model of what humans want from a chatbot, and then the AI is trained to perform well according to the modeled human preferences.
The open problems for general value learning are also open problems for training chatbots to be reasonable. How do you handle human inconsistency and disagreement? How do you build trust that the end product is actually reasonable, when that’s so hard to define? Etc. But the problems have less “bite,” because less can go wrong when your AI is briefly responding to a human query than when your AI is using a computer and navigating complicated real-world problems on its own.
You might hope we can just say value learning is hard, and not needed anyhow because chatbots need it less than agents do, so we don’t have to worry about it. But the chatbot paradigm is only a few years old, and there is no particular reason it should be eternal. There are powerful economic (and military) pressures towards building agents that can act rapidly and remain on-task over long time scales. AI safety research needs to anticipate future problems and start work on them ahead of time, which means we need to be prepared for instilling some quite ambitious “reasonableness” into AI agents.
For a decent introduction from 2018, see this collection.
Citation needed.