Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won’t just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won’t actually use “human values” as an abstract object in its internal calculations; it will work with some lower-level “components” which themselves implement/embed human values.
If it’s read moral philosophy, it should have some notion of what the words “human values” mean.
In any case, I still don’t understand what you’re trying to get at. Suppose I pretrain a neural net to differentiate lots of non-marsupial animals. It doesn’t know what a koala looks like, but it has some lower-level “components” which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.
This is actually a tougher scenario than what you’re describing (GPT will have seen human values yet the pretrained net hasn’t seen koalas in my hypothetical), but it’s a boring application of transfer learning.
Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.
Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.
I generally agree with this. The things I’m saying about human values also apply to koala classification. As with koalas, I do think there’s probably a wide range of parameters which would end up using the “right” level of abstraction for human values to be “natural”. On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute—again, because Bayesian updates on low-level physics are just better in terms of predictive power.
Right now, we have no idea when that line will be crossed—just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either “koalas” or “human values” is a natural level of abstraction.
It doesn’t know what a koala looks like, but it has some lower-level “components” which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.
Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. “Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
When we move to human values, that distinction becomes a lot more important: “physical process which generates the label ‘human values satisfied’” is not the same as “human values satisfied”. Confusing those two is how we get Goodhart problems.
We don’t need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept “koala”, rather than “physical process which generates the label koala”, it has to be constrained on compute in a way which makes the latter too expensive—despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible to use those lower-level components to model the physical process which generates the label koala without directly reasoning about koalas.
I’ve now written essentially the same response at least four times to your objections, so I recommend applying the general pattern yourself:
Consider how Bayesian updates on a low-level physics model would behave on whatever task you’re considering. What would go wrong?
Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
What’s preventing ML systems from failing in that way already? The answer is probably “they don’t have enough compute to get higher predictive power from a less abstract model”—which means that, if things keep scaling up, sooner or later that failure will happen.
You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
“Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I’ve now written essentially the same response at least four times to your objections
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way.
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
or example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments?
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
Optimize for having a diverse range of models that all seem to fit the data.
If it’s read moral philosophy, it should have some notion of what the words “human values” mean.
GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase “human values”, since moral philosophy is written by (confused) humans, and in human-written text the phrase “human values” is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.
This is essentially the “tasty ice cream flavors” problem, am I right? Trying to check if we’re on the same page.
If so: John Wentsworth said
“Tasty ice cream flavors” is also a natural category if we know who the speaker is
So how about instead of talking about “human values”, we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.
No, this is not the “tasty ice cream flavors” problem. The problem there is that the concept is inherently relative to a person. That problem could apply to “human values”, but that’s a separate issue from what dxu is talking about.
The problem is that “what a committee of famous moral philosophers would endorse saying/doing”, or human written text containing the phrase “human values”, is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.
Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values—an inferior proxy, actually. Seems to me you’re letting the perfect be the enemy of the good.
It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. “Don’t let the perfect be the enemy of the good” is advice for writing emails and cleaning the house, not nuclear security.
The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.
Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually have. They’d probably be better than many of the worst-case scenarios, but they still wouldn’t be a best or even good scenario. Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that.
Here are some of the people who have the power to set off nukes right now:
“A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future.”—George Patton
Just because it’s in your nature (and my nature, and the nature of many people who read this site) to be a cautious nerd, does not mean that the cautious nerd orientation is always the best orientation to have.
In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome. It’s a classic motte-and-bailey:
“It’s very hard to build an AGI which isn’t a paperclipper!”
“Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI...”
“Yeah but we gotta be super perfectionistic because there is so much at stake!”
Your final “humans will misuse AI” worry may be justified, but I think naive deployment of this worry is likely to be counterproductive. Suppose there are two types of people, “cautious” and “incautious”. Suppose that the “humans will misuse AI” worry discourages cautious people from developing AGI, but not incautious people. So now we’re in a world where the first AGI is most likely controlled by incautious people, making the “humans will misuse AI” worry even more severe.
Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
If you’re willing to grant the premise of the technical alignment problem being solved, shooting oneself in the foot would appear to be much less of a worry, because you can simply tell your FAI “please don’t let me shoot myself in the foot too badly”, and it will prevent you from doing that.
“It’s very hard to build an AGI which isn’t a paperclipper!”
“Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI...”
“Yeah but we gotta be super perfectionistic because there is so much at stake!”
There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper. Yes, there are straightforward ways one might be able to create a helpful non-paperclipper AGI. But that “might” is carrying a lot of weight. All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don’t know exactly what those parameter ranges are.
It’s sort of like saying:
“It’s very hard to design a long bridge which won’t fall down!”
“Well actually here are some straightforward ways one might be able to create a long non-falling-down bridge...” <shows picture of a wooden truss>
What I’m saying is, that truss is design is 100% going to fail once it gets big enough, and we don’t currently know how big that is. When I say “it’s hard to design a long bridge which won’t fall down”, I do not mean a bridge which might not fall down if we’re lucky and just happen to be within the safe parameter range.
In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome.
These are sufficient conditions for a careful strategy to make sense, not necessary conditions. Here’s another set of sufficient conditions, which I find more realistic: the gains to be had in reducing AI risk are binary. Either we find the “right” way of doing things, in which case risk drops to near-zero, or we don’t, in which case it’s a gamble and we don’t have much ability to adjust the chances/payoff. There are no significant marginal gains to be had.
There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.
This is simultaneously
a major retreat from the “default outcome is doom” thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that’s is 99.9% likely to be safe, which is very much incompatible with “default outcome is doom”)
unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn’t good enough for you)
You’ve picked a position vaguely in between the motte and the bailey and said “the motte and the bailey are both equivalent to this position!” That doesn’t look at all true to me.
All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don’t know exactly what those parameter ranges are.
This is a very strong claim which to my knowledge has not been well-justified anywhere. Daniel K agreed with me the other day that there isn’t a standard reference for this claim. Do you know of one?
There are a couple problems I see here:
Simple is not the same as obvious. Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many “obvious” solutions they didn’t think of.
Nothing ever gets counted as evidence against this claim. Simple proposals get rejected on the basis that everyone knows simple proposals won’t work.
A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety. Maybe there are good arguments for that, but the problem is that if you’re not careful, your view of reality is gonna get distorted. Which means community wisdom on claims such as “simple solutions never work” is likely to be systematically wrong. “Everyone knows X”, without a good written defense of X, or a good answer to “what would change the community’s mind about X”, is fertile ground for information cascades etc. And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).
What I’m saying is, that truss is design is 100% going to fail once it gets big enough, and we don’t currently know how big that is. When I say “it’s hard to design a long bridge which won’t fall down”, I do not mean a bridge which might not fall down if we’re lucky and just happen to be within the safe parameter range.
My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks. This is logically rude. And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not. From my perspective, you’ve pulled this conversational move multiple times in this thread. It seems to be pretty common when I have discussions about AI safety people. That’s part of why I find the discussions so frustrating. My view is that this is a cultural problem which has to be solved for the AI safety community to do much useful AI safety work (as opposed to “complaining about how hard AI safety is” work, which is useful but insufficient).
Anyway, I’ll let you have the last word in this thread.
For what it’s worth, my perception of this thread is the opposite of yours: it seems to me John Wentworth’s arguments have been clear, consistent, and easy to follow, whereas you (John Maxwell) have been making very little effort to address his position, instead choosing to repeatedly strawman said position (and also repeatedly attempting to lump in what Wentworth has been saying with what you think other people have said in the past, thereby implicitly asking him to defend whatever you think those other people’s positions were).
Whether you’ve been doing this out of a lack of desire to properly engage, an inability to comprehend the argument itself, or some other odd obstacle is in some sense irrelevant to the object-level fact of what has been happening during this conversation. You’ve made your frustration with “AI safety people” more than clear over the course of this conversation (and I did advise you not to engage further if that was the case!), but I submit that in this particular case (at least), the entirety of your frustration can be traced back to your own lack of willingness to put forth interpretive labor.
To be clear: I am making this comment in this tone (which I am well aware is unkind) because there are multiple aspects of your behavior in this thread that I find not only logically rude, but ordinarily rude as well. I more or less summarized these aspects in the first paragraph of my comment, but there’s one particularly onerous aspect I want to highlight: over the course of this discussion, you’ve made multiple references to other uninvolved people (either with whom you agree or disagree), without making any effort at all to lay out what those people said or why it’s relevant to the current discussion. There are two examples of this from your latest comment alone:
Daniel K agreed with me the other day that there isn’t a standard reference for this claim. [Note: your link here is broken; here’s a fixed version.]
A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety.
Ignoring the question of whether these two quoted statements are true (note that even the fixed version of the link above goes only to a top-level post, and I don’t see any comments on that post from the other day), this is counterproductive for a number of reasons.
Firstly, it’s inefficient. If you believe a particular statement is false (and furthermore, that your basis for this belief is sound), you should first attempt to refute that statement directly, which gives your interlocutor the opportunity to either counter your refutation or concede the point, thereby moving the conversation forward. If you instead counter merely by invoking somebody else’s opinion, you both increase the difficulty of answering and end up offering weaker evidence.
Secondly, it’s irrelevant. John Wentworth does not work at MIRI (neither does Daniel Kokotajlo, for that matter), so bringing up aspects of MIRI’s position you dislike does nothing but highlight a potential area where his position differs from MIRI’s. (I say “potential” because it’s not at all obvious to me that you’ve been representing MIRI’s position accurately.) In order to properly challenge his position, again it becomes more useful to critique his assertions directly rather than round them off to the closest thing said by someone from MIRI.
Thirdly, it’s a distraction. When you regularly reference a group of people who aren’t present in the actual conversation, repeatedly make mention of your frustration and “grumpiness” with those people, and frequently compare your actual interlocutor’s position to what you imagine those people have said, all while your actual interlocutor has said nothing to indicate affiliation with or endorsement of those people, it doesn’t paint a picture of an objective critic. To be blunt: it paints a picture of someone with a one-sided grudge against the people in question, and is attempting to inject that grudge into conversations where it shouldn’t be present.
I hope future conversations can be more pleasant than this.
I appreciate the defense and agree with a fair bit of this. That said, I’ve actually found the general lack of interpretive labor somewhat helpful in this instance—it’s forcing me to carefully and explicitly explain a lot of things I normally don’t, and John Maxwell has correctly pointed out a lot of seeming-inconsistencies in those explanations. At the very least, it’s helping make a lot of my models more explicit and legible. It’s mentally unpleasant, but a worthwhile exercise to go through.
I think I want John to feel able to have this kind of conversation when it feels fruitful to him, and not feel obligated to do so otherwise. I expect this is the case, but just wanted to make it common knowledge.
This is a very strong claim which to my knowledge has not been well-justified anywhere. Daniel K agreed with me the other day that there isn’t a standard reference for this claim. Do you know of one?
There isn’t a standard reference because the argument takes one sentence, and I’ve been repeating it over and over again: what would Bayesian updates on low-level physics do? That’s the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way.
(BTW I think that link is dead)
My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks. This is logically rude. And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not.
The “what would Bayesian updates on a low-level model do?” question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again.
This does point to one possibly-useful-to-notice ambiguous point: the difference between “this method would produce an aligned AI” vs “this method would continue to produce aligned AI over time, as things scale up”. I am definitely thinking mainly about long-term alignment here; I don’t really care about alignment on low-power AI like GPT-3 except insofar as it’s a toy problem for alignment of more powerful AIs (or insofar as it’s profitable, but that’s a different matter).
I’ve been less careful than I should be about distinguishing these two in this thread. All these things which we’re saying “might work” are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That’s probably part of why it seems like I keep switching positions—I haven’t been properly distinguishing when we’re talking short-term vs long-term.
A second comment on this:
instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks
If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything.
Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI’s capabilities grow.
I think you currently do not understand the failure mode I keep pointing to by saying “what would Bayesian updates on low-level physics do?”. Elsewhere in the thread, you said that optimizing “for having a diverse range of models that all seem to fit the data” would fix the problem, which is my main evidence that you don’t understand the problem. The problem is not “the data underdetermines what we’re asking for”, the problem is “the data fully determines what we’re asking for, and we’re asking for a proxy rather than the thing we actually want”.
If it’s read moral philosophy, it should have some notion of what the words “human values” mean.
In any case, I still don’t understand what you’re trying to get at. Suppose I pretrain a neural net to differentiate lots of non-marsupial animals. It doesn’t know what a koala looks like, but it has some lower-level “components” which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.
This is actually a tougher scenario than what you’re describing (GPT will have seen human values yet the pretrained net hasn’t seen koalas in my hypothetical), but it’s a boring application of transfer learning.
Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.
I generally agree with this. The things I’m saying about human values also apply to koala classification. As with koalas, I do think there’s probably a wide range of parameters which would end up using the “right” level of abstraction for human values to be “natural”. On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute—again, because Bayesian updates on low-level physics are just better in terms of predictive power.
Right now, we have no idea when that line will be crossed—just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either “koalas” or “human values” is a natural level of abstraction.
Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. “Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
When we move to human values, that distinction becomes a lot more important: “physical process which generates the label ‘human values satisfied’” is not the same as “human values satisfied”. Confusing those two is how we get Goodhart problems.
We don’t need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept “koala”, rather than “physical process which generates the label koala”, it has to be constrained on compute in a way which makes the latter too expensive—despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible to use those lower-level components to model the physical process which generates the label koala without directly reasoning about koalas.
I’ve now written essentially the same response at least four times to your objections, so I recommend applying the general pattern yourself:
Consider how Bayesian updates on a low-level physics model would behave on whatever task you’re considering. What would go wrong?
Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
What’s preventing ML systems from failing in that way already? The answer is probably “they don’t have enough compute to get higher predictive power from a less abstract model”—which means that, if things keep scaling up, sooner or later that failure will happen.
You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
Optimize for having a diverse range of models that all seem to fit the data.
How would that fix any of the problems we’ve been talking about?
GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase “human values”, since moral philosophy is written by (confused) humans, and in human-written text the phrase “human values” is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.
This is essentially the “tasty ice cream flavors” problem, am I right? Trying to check if we’re on the same page.
If so: John Wentsworth said
So how about instead of talking about “human values”, we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.
No, this is not the “tasty ice cream flavors” problem. The problem there is that the concept is inherently relative to a person. That problem could apply to “human values”, but that’s a separate issue from what dxu is talking about.
The problem is that “what a committee of famous moral philosophers would endorse saying/doing”, or human written text containing the phrase “human values”, is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.
Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values—an inferior proxy, actually. Seems to me you’re letting the perfect be the enemy of the good.
It doesn’t matter how high-fidelity the upload is or how benevolent the human is, I’m not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. “Don’t let the perfect be the enemy of the good” is advice for writing emails and cleaning the house, not nuclear security.
The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.
Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually have. They’d probably be better than many of the worst-case scenarios, but they still wouldn’t be a best or even good scenario. Humans just don’t have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.
Here are some of the people who have the power to set off nukes right now:
Donald Trump
Vladimir Putin
Kim Jong-un
Both parties in this conflict
And this conflict
Tell that to the Norwegian commandos who successfully sabotaged Hitler’s nuclear weapons program.
“A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future.”—George Patton
Just because it’s in your nature (and my nature, and the nature of many people who read this site) to be a cautious nerd, does not mean that the cautious nerd orientation is always the best orientation to have.
In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome. It’s a classic motte-and-bailey:
“It’s very hard to build an AGI which isn’t a paperclipper!”
“Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI...”
“Yeah but we gotta be super perfectionistic because there is so much at stake!”
Your final “humans will misuse AI” worry may be justified, but I think naive deployment of this worry is likely to be counterproductive. Suppose there are two types of people, “cautious” and “incautious”. Suppose that the “humans will misuse AI” worry discourages cautious people from developing AGI, but not incautious people. So now we’re in a world where the first AGI is most likely controlled by incautious people, making the “humans will misuse AI” worry even more severe.
If you’re willing to grant the premise of the technical alignment problem being solved, shooting oneself in the foot would appear to be much less of a worry, because you can simply tell your FAI “please don’t let me shoot myself in the foot too badly”, and it will prevent you from doing that.
There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper. Yes, there are straightforward ways one might be able to create a helpful non-paperclipper AGI. But that “might” is carrying a lot of weight. All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don’t know exactly what those parameter ranges are.
It’s sort of like saying:
“It’s very hard to design a long bridge which won’t fall down!”
“Well actually here are some straightforward ways one might be able to create a long non-falling-down bridge...” <shows picture of a wooden truss>
What I’m saying is, that truss is design is 100% going to fail once it gets big enough, and we don’t currently know how big that is. When I say “it’s hard to design a long bridge which won’t fall down”, I do not mean a bridge which might not fall down if we’re lucky and just happen to be within the safe parameter range.
These are sufficient conditions for a careful strategy to make sense, not necessary conditions. Here’s another set of sufficient conditions, which I find more realistic: the gains to be had in reducing AI risk are binary. Either we find the “right” way of doing things, in which case risk drops to near-zero, or we don’t, in which case it’s a gamble and we don’t have much ability to adjust the chances/payoff. There are no significant marginal gains to be had.
This is simultaneously
a major retreat from the “default outcome is doom” thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that’s is 99.9% likely to be safe, which is very much incompatible with “default outcome is doom”)
unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn’t good enough for you)
You’ve picked a position vaguely in between the motte and the bailey and said “the motte and the bailey are both equivalent to this position!” That doesn’t look at all true to me.
This is a very strong claim which to my knowledge has not been well-justified anywhere. Daniel K agreed with me the other day that there isn’t a standard reference for this claim. Do you know of one?
There are a couple problems I see here:
Simple is not the same as obvious. Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many “obvious” solutions they didn’t think of.
Nothing ever gets counted as evidence against this claim. Simple proposals get rejected on the basis that everyone knows simple proposals won’t work.
A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety. Maybe there are good arguments for that, but the problem is that if you’re not careful, your view of reality is gonna get distorted. Which means community wisdom on claims such as “simple solutions never work” is likely to be systematically wrong. “Everyone knows X”, without a good written defense of X, or a good answer to “what would change the community’s mind about X”, is fertile ground for information cascades etc. And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).
My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks. This is logically rude. And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not. From my perspective, you’ve pulled this conversational move multiple times in this thread. It seems to be pretty common when I have discussions about AI safety people. That’s part of why I find the discussions so frustrating. My view is that this is a cultural problem which has to be solved for the AI safety community to do much useful AI safety work (as opposed to “complaining about how hard AI safety is” work, which is useful but insufficient).
Anyway, I’ll let you have the last word in this thread.
For what it’s worth, my perception of this thread is the opposite of yours: it seems to me John Wentworth’s arguments have been clear, consistent, and easy to follow, whereas you (John Maxwell) have been making very little effort to address his position, instead choosing to repeatedly strawman said position (and also repeatedly attempting to lump in what Wentworth has been saying with what you think other people have said in the past, thereby implicitly asking him to defend whatever you think those other people’s positions were).
Whether you’ve been doing this out of a lack of desire to properly engage, an inability to comprehend the argument itself, or some other odd obstacle is in some sense irrelevant to the object-level fact of what has been happening during this conversation. You’ve made your frustration with “AI safety people” more than clear over the course of this conversation (and I did advise you not to engage further if that was the case!), but I submit that in this particular case (at least), the entirety of your frustration can be traced back to your own lack of willingness to put forth interpretive labor.
To be clear: I am making this comment in this tone (which I am well aware is unkind) because there are multiple aspects of your behavior in this thread that I find not only logically rude, but ordinarily rude as well. I more or less summarized these aspects in the first paragraph of my comment, but there’s one particularly onerous aspect I want to highlight: over the course of this discussion, you’ve made multiple references to other uninvolved people (either with whom you agree or disagree), without making any effort at all to lay out what those people said or why it’s relevant to the current discussion. There are two examples of this from your latest comment alone:
Ignoring the question of whether these two quoted statements are true (note that even the fixed version of the link above goes only to a top-level post, and I don’t see any comments on that post from the other day), this is counterproductive for a number of reasons.
Firstly, it’s inefficient. If you believe a particular statement is false (and furthermore, that your basis for this belief is sound), you should first attempt to refute that statement directly, which gives your interlocutor the opportunity to either counter your refutation or concede the point, thereby moving the conversation forward. If you instead counter merely by invoking somebody else’s opinion, you both increase the difficulty of answering and end up offering weaker evidence.
Secondly, it’s irrelevant. John Wentworth does not work at MIRI (neither does Daniel Kokotajlo, for that matter), so bringing up aspects of MIRI’s position you dislike does nothing but highlight a potential area where his position differs from MIRI’s. (I say “potential” because it’s not at all obvious to me that you’ve been representing MIRI’s position accurately.) In order to properly challenge his position, again it becomes more useful to critique his assertions directly rather than round them off to the closest thing said by someone from MIRI.
Thirdly, it’s a distraction. When you regularly reference a group of people who aren’t present in the actual conversation, repeatedly make mention of your frustration and “grumpiness” with those people, and frequently compare your actual interlocutor’s position to what you imagine those people have said, all while your actual interlocutor has said nothing to indicate affiliation with or endorsement of those people, it doesn’t paint a picture of an objective critic. To be blunt: it paints a picture of someone with a one-sided grudge against the people in question, and is attempting to inject that grudge into conversations where it shouldn’t be present.
I hope future conversations can be more pleasant than this.
I appreciate the defense and agree with a fair bit of this. That said, I’ve actually found the general lack of interpretive labor somewhat helpful in this instance—it’s forcing me to carefully and explicitly explain a lot of things I normally don’t, and John Maxwell has correctly pointed out a lot of seeming-inconsistencies in those explanations. At the very least, it’s helping make a lot of my models more explicit and legible. It’s mentally unpleasant, but a worthwhile exercise to go through.
I think I want John to feel able to have this kind of conversation when it feels fruitful to him, and not feel obligated to do so otherwise. I expect this is the case, but just wanted to make it common knowledge.
There isn’t a standard reference because the argument takes one sentence, and I’ve been repeating it over and over again: what would Bayesian updates on low-level physics do? That’s the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way.
(BTW I think that link is dead)
The “what would Bayesian updates on a low-level model do?” question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again.
This does point to one possibly-useful-to-notice ambiguous point: the difference between “this method would produce an aligned AI” vs “this method would continue to produce aligned AI over time, as things scale up”. I am definitely thinking mainly about long-term alignment here; I don’t really care about alignment on low-power AI like GPT-3 except insofar as it’s a toy problem for alignment of more powerful AIs (or insofar as it’s profitable, but that’s a different matter).
I’ve been less careful than I should be about distinguishing these two in this thread. All these things which we’re saying “might work” are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That’s probably part of why it seems like I keep switching positions—I haven’t been properly distinguishing when we’re talking short-term vs long-term.
A second comment on this:
If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything.
Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI’s capabilities grow.
I think you currently do not understand the failure mode I keep pointing to by saying “what would Bayesian updates on low-level physics do?”. Elsewhere in the thread, you said that optimizing “for having a diverse range of models that all seem to fit the data” would fix the problem, which is my main evidence that you don’t understand the problem. The problem is not “the data underdetermines what we’re asking for”, the problem is “the data fully determines what we’re asking for, and we’re asking for a proxy rather than the thing we actually want”.