I think the potential difference between you and me on whether synthetic data works to box an AI is whether the AI notices it’s being in a simulation made via synthetic data, and also I think that it’s not intended to be applied post-training, it’s instead applied continuously throughout the training process.
I agree that if we had an AGI/ASI that was already misaligned, we’d have to do pretty extreme actions like mindwipe it’s memories and restart the entire training process again, but the point of synthetic data is to get it into a basin of alignment/corrigibility early on, before it can be deceptive.
I also think that real data will be only given to AGIs at the end of training as a way to ground them, so it has no real way to know whether it’s subtly being changed in training or whether it’s in reality, since we control their data sources.
Controlling an AI’s data sources is a powerful way to control their values and capabilities, which was why I think that the tax for the synthetic data alignment is actually pretty low.
Re Response to Lethality 6, I’m honestly coming around to your position as I think more and more, at least to the extent that I think your arguments are plausible and we need more research on that.
Re Response to Lethality 10, I was relying on both empirical evidence from today’s models and some theoretical reasons for why the phenomenon of alignment generalizing further than capabilities exists in general.
On the alignment stability problem, I like the post, and we should plausibly do interventions to stabilize alignment once we get it.′
Re Response to Lethality 15, I agree with the idea that fast capability progress will happen, but deny the implication, because of both large synthetic datasets on values/instruction following already being in the AI when the fast capabilities progress happened, because synthetic data about values is pretrained in very early, combined with me being more optimistic about alignment generalization than you are.
I liked your Real AGI post by the way.
Response to Lethalities 16 and 17 I agree that the analogy with evolution doesn’t go very far, since we’re a lot smarter and have much better tools for alignment than evolution. We don’t have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won’t get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
One concrete way we have better tools than evolution is we have far more control over what their data sources are, and more generally we have far more inspectability and controllability over their data, especially the synthetic kind, which means we don’t have to create very realistic simulations, since for all the AI knows, we might be elaborately fooling it to reveal itself, and until the very end of training, it probably doesn’t even have specific data on our reality.
Re Response to Lethality 21, you are exactly correct on what I meant.
Re Response to Lethality 22, on this:
I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
You’re not wrong they got human capabilities very wrong, see this post for details:
But I’d also argue that this has implications for how complex human values actually are.
On this:
Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
Yeah, this seems like a crux, since I think that a lot of how values is learned is basically via quite weak priors from evolutionary drives, and I’d say the big one is probably which algorithm we have for being intelligent, and put far more weight on socialization/environment data than you do, closer to 5-10% evolution at best/85-90% data and culture determine our values.
But at any rate, since AIs will be influenced by their data a lot, this means that it’s tractable to influence their values.
After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
Agree with this mostly, though culture is IMO the best explanation for why humans succeed.
So, what is the “simple core” of human values you mention? Is it what people have written about human values? I’d pretty much agree that that’s a usable core, even if it’s not simple.
Yes, I am talking about what people have written about human values, but I’m also talking about future synthetic data where we write about what values we want the AI to have, and I’m also talking about reward information as a simple core.
One of my updates on Constitiutional AI and GPT-4 handling what we value pretty well is that the claim that value is complicated is mostly untrue, and in general updated hard against evopsych explainations about what humans value, how humans got their capabilities and more, since the data we got is very surprising under evopsych hypotheses and less surprising under Universal Learning Machine hypotheses.
I agree with all of this for the record:
I very much agree with Yudkowsky’s framing here:
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
Re response to lethalities 28 and 29, I think you meant that you totally disagree, and I agree we’d probably be boned if a misaligned AGI/ASI was running loose, but my point is that the verification/generation gap that pervades so many fields also is likely to apply to alignment research, where it’s easier to verify if research is correct than to do it yourself.
Re response to Lethality 32:
WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
It isn’t, except if you want it to be and work to make sure it’s communicating your thoughts well enough.
There’s some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
There is definitely a chance that RL or other processes make their use of language diverge more from their thoughts, so I’m a little worried about that, but I do think that AI words do convey their thoughts, at least for current LLMs.
39. I figured this stuff out using thenull stringas input,
Yudkowsky may very well be the smartest human whose thought I’ve personally encountered. That doesn’t make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud’s authority on the topic).
I think my difference wrt you is I consider his models and arguments for AI doom essentially irreparable for the most part due to reality invalidating his core assumptions of how AGIs/ASIs work, and also how human capabilities and values work and are learned, so I don’t think Yud’s authority on the topic earns him any epistemic points.
My point was basically that you cannot figure out anything using a null string as input, for the same reason you cannot update on no evidence of something happening as if there was evidence for something happening.
Agree with the rest of it though.
Thanks for your excellent comment, which gave me lots of food for thought.
Some thoughts on your excellent comment:
First, I fixed the link issue you saw.
I think the potential difference between you and me on whether synthetic data works to box an AI is whether the AI notices it’s being in a simulation made via synthetic data, and also I think that it’s not intended to be applied post-training, it’s instead applied continuously throughout the training process.
I agree that if we had an AGI/ASI that was already misaligned, we’d have to do pretty extreme actions like mindwipe it’s memories and restart the entire training process again, but the point of synthetic data is to get it into a basin of alignment/corrigibility early on, before it can be deceptive.
I also think that real data will be only given to AGIs at the end of training as a way to ground them, so it has no real way to know whether it’s subtly being changed in training or whether it’s in reality, since we control their data sources.
Controlling an AI’s data sources is a powerful way to control their values and capabilities, which was why I think that the tax for the synthetic data alignment is actually pretty low.
Re Response to Lethality 6, I’m honestly coming around to your position as I think more and more, at least to the extent that I think your arguments are plausible and we need more research on that.
Re Response to Lethality 10, I was relying on both empirical evidence from today’s models and some theoretical reasons for why the phenomenon of alignment generalizing further than capabilities exists in general.
On the alignment stability problem, I like the post, and we should plausibly do interventions to stabilize alignment once we get it.′
Re Response to Lethality 15, I agree with the idea that fast capability progress will happen, but deny the implication, because of both large synthetic datasets on values/instruction following already being in the AI when the fast capabilities progress happened, because synthetic data about values is pretrained in very early, combined with me being more optimistic about alignment generalization than you are.
I liked your Real AGI post by the way.
One concrete way we have better tools than evolution is we have far more control over what their data sources are, and more generally we have far more inspectability and controllability over their data, especially the synthetic kind, which means we don’t have to create very realistic simulations, since for all the AI knows, we might be elaborately fooling it to reveal itself, and until the very end of training, it probably doesn’t even have specific data on our reality.
Re Response to Lethality 21, you are exactly correct on what I meant.
Re Response to Lethality 22, on this:
You’re not wrong they got human capabilities very wrong, see this post for details:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
But I’d also argue that this has implications for how complex human values actually are.
On this:
Yeah, this seems like a crux, since I think that a lot of how values is learned is basically via quite weak priors from evolutionary drives, and I’d say the big one is probably which algorithm we have for being intelligent, and put far more weight on socialization/environment data than you do, closer to 5-10% evolution at best/85-90% data and culture determine our values.
But at any rate, since AIs will be influenced by their data a lot, this means that it’s tractable to influence their values.
Agree with this mostly, though culture is IMO the best explanation for why humans succeed.
Yes, I am talking about what people have written about human values, but I’m also talking about future synthetic data where we write about what values we want the AI to have, and I’m also talking about reward information as a simple core.
One of my updates on Constitiutional AI and GPT-4 handling what we value pretty well is that the claim that value is complicated is mostly untrue, and in general updated hard against evopsych explainations about what humans value, how humans got their capabilities and more, since the data we got is very surprising under evopsych hypotheses and less surprising under Universal Learning Machine hypotheses.
I agree with all of this for the record:
Re response to lethalities 28 and 29, I think you meant that you totally disagree, and I agree we’d probably be boned if a misaligned AGI/ASI was running loose, but my point is that the verification/generation gap that pervades so many fields also is likely to apply to alignment research, where it’s easier to verify if research is correct than to do it yourself.
Re response to Lethality 32:
There is definitely a chance that RL or other processes make their use of language diverge more from their thoughts, so I’m a little worried about that, but I do think that AI words do convey their thoughts, at least for current LLMs.
I think my difference wrt you is I consider his models and arguments for AI doom essentially irreparable for the most part due to reality invalidating his core assumptions of how AGIs/ASIs work, and also how human capabilities and values work and are learned, so I don’t think Yud’s authority on the topic earns him any epistemic points.
My point was basically that you cannot figure out anything using a null string as input, for the same reason you cannot update on no evidence of something happening as if there was evidence for something happening.
Agree with the rest of it though.
Thanks for your excellent comment, which gave me lots of food for thought.