Once difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn’t as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.
mattmacdermott
I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.
Even granting that, do you think the same applies to the cognition of an AI created using deep learning—is it approximating Solomonoff induction when presented with a new problem at inference time?
I think it’s not, for reasons like the ones in aysja’s comment.
Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.
I wrote it out as a post here.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Fair enough, yeah—this seems like a very reasonable angle of attack.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.
In this post you use ‘goals’ in quite a broad way, so as to include stuff like virtues (e.g. “always be honest”). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it’s motivated primarily by things like virtues, habits, or rules.
This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it’s plausible that
(i) if you get an AI with a slightly wrong consequentialist goal (e.g. “maximise everyone’s schmellbeing”) then you get paperclipped because of instrumental convergence,
(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. “always be schmonest”) then it’s badly dysfunctional but doesn’t directly entail a disaster.
And if that’s correct, then we should care about the question “Will the AI’s goals be consequentialist ones?” more than most questions about them.
You know you’re feeling the AGI when a compelling answer to “What’s the best argument for very short AI timelines?” lengthens your timelines
Interesting. My handwavey rationalisation for this would be something like:
there’s some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
for simple triggers, the circuitry is very inactive in the absense of the trigger, so it’s unaffected by normal training
for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it’s more affected by normal training
I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there’s a sort of no free lunch theorem: either you don’t act on the outputs of the oracle at all, or it’s just as much of an agent as any other system.
I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you’re going to take in response to its outputs, its outputs have to be different. But the point of an oracle is to not have that property—its outputs are decided by a criterion (something like truth) -- that is independent of the actions you’re going to take in response[1]. So if you respond differently to the outputs, they cause different outcomes. Assuming you’ve succeeded at building the oracle to specification, it’s clearly not the case that the oracle has the worrying property of agents just because you act on its outputs.
I don’t disagree that by either hooking the oracle up in a scaffolded feedback loop with the environment, or getting it to output plans, you could extract more agency from it. Of the two I think the scaffolding can in principle easily produce dangerous agency in the same way long-horizon RL can, but that the version where you get it to output a plan is much less worrrying (I can argue for that in a separate comment if you like).
- ↩︎
I’m ignoring the self-fulfilling prophecy case here.
- ↩︎
“It seems silly to choose your values and behaviors and preferences just because they’re arbitrarily connected to your social group.”
If you think this way, then you’re already on the outside.
I don’t think this is true — your average person would agree with the quote (if asked) and deny that it applies to them.
Finetuning generalises a lot but not to removing backdoors?
Seems like we don’t really disagree
The arguments in the paper are representative of Yoshua’s views rather than mine, so I won’t directly argue for them, but I’ll give my own version of the case against
the distinctions drawn here between RL and the science AI all break down at high levels.
It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you’re training and the outcomes being achieved.
At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.
Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.
And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.
At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it’s a powerful agent. It’s a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely—its input-output behaviour wasn’t selected to be effective at causing certain outcomes.
Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don’t think they’re representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that’s an importantly different situation compared to the top of the spectrum.
And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident—we don’t understand enough to rule that out. But I don’t see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it’s just that you think the distinction is not being handled carefully in the paper?
Pre-training, finetuning and RL are all types of training. But sure, expand ‘train’ to ‘create’ in order to include anything else like scaffolding. The point is it’s not what you do in response to the outputs of the system, it’s what the system tries to do.
Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?
(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
For some reason I’ve been muttering the phrase, “instrumental goals all the way up” to myself for about a year, so I’m glad somebody’s come up with an idea to attach it to.
Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.