First and foremost, great post! “How do we get GPT to give the best health advice it can give?” is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.
Now on to the more interesting part: how does this general strategy fail horribly?
If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far and away most likely is roughly “you get what you can measure” or “you get something designed to look good to human supervisors without actually being good”. In other words, the inability of humans to reliably/robustly evaluate outcomes is the big problem. (The Fusion Power Generator Scenario is a one good example of the type of failure I’m talking about here—the human doesn’t understand what-they-want at a detailed enough level to even ask the right questions, let alone actually evaluate a design.)
So: I expect any version of “align narrowly superhuman models” which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment. Worse, I expect that those are exactly the sort of projects which will produce the most impressive demos, potentially attract investors, etc. After all, their outputs are optimized for looking good to humans (without actually being good) - of course they’re going to look good to human investors, engineers, etc!
Now, what’s really interesting about this piece is that you propose at least one approach—the sandwich method—explicitly addressing that failure mode. Personally, I found that idea the most interesting and promising part of this whole piece. I’ll even register a prediction: 80% the “sandwich problem” cannot be solved in a domain-generalizable way without major conceptual progress on (outer) alignment. Though I would not be surprised if attempts to solve the sandwich problem failed in ways which directly led to at least some conceptual progress, and I do still expect empirical work on the problem is likely to be valuable for that reason. (Though I also expect that a lot of people will read the description of the sandwich problem and fail to understand the requirement which makes it interesting—namely, that the experts be completely and totally absent from the training process, and in particular no data from experts should be involved in the training process.) If I thought I had a generalizable method capable of solving the sandwich problem, it would probably be the highest-priority thing on my agenda.
Thanks for the comment! Just want to explicitly pull out and endorse this part:
the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process
I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the “sandwich” problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).
I also broadly agree with you that “things looking good to humans without actually being good” is a major problem to watch out for. But I don’t think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)
But I don’t think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback.
I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it’s a lot easier to make something which looks impressive than something which solves a Hard problem (like the sandwich problem), and therefore most impressive-looking “solutions” will probably circumvent the key part of the problem. And if the Hard problem is indeed hard enough to not be solved by anyone, the most impressive-looking results will be those which look good without actually solving it.
I guess the crux here is “And if the Hard problem is indeed hard enough to not be solved by anyone,” — I don’t think that’s the default/expected outcome. There hasn’t been that much effort on this problem in the scheme of things, and I think we don’t know where it ranges from “pretty easy” to “very hard” right now.
Ah… I think we have an enormous amount of evidence on very-similar problems.
For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn’t know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the “sandwich problem” would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don’t think we have a good solution in practice; I’d expect the expert business-owner to usually come up with a much better contract.
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn’t understand what the designer wants), versus a product designer who’s also a fluent coder and familiar with the code base. I’ve experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I’ve seen this one first-hand too.)
One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can’t write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there’s hope in the AI case (e.g. that’s a hope behind iterated amplification).
How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.
Yeah, sorry, that’s right, I was speaking pretty loosely. You’d still have the same hope—maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.
Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don’t see any reason at all to expect it to do anything remotely similar to that.
The intuition for it is something like this: suppose I’m trying to make a difficult decision, like where to buy a house. There are hundreds of cities I’d be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.
If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of “holistic judgment of neighborhood X”, and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.
Yeah, in the context of a larger alignment scheme, it’s assuming that in particular the problem of answering the question “How good is the AI’s proposed action?” will factor down into sub-questions of manageable size.
I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)
That’s what I have in mind. If all goes well you can think of it like “a human thinking a long time.” We don’t know if all will go well.
It’s also not really clear what “a human thinking 10,000 years” means, HCH is kind of an operationalization of that, but there’s a presumption of alignment in the human-thinking-a-long-time that we don’t get for free here. (Of course you also wouldn’t get it for free if you somehow let a human live for 10,000 years...)
Well, Paul’s original post presents HCH as the specification of a human enlightened judgement.
For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
And if we follow the links to Paul’s previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.
So it looks to me like “HCH captures the judgment of the human after thinking from a long time” is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don’t know the answer.
A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.
I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:
Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???
… I don’t really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.
I don’t see any reason at all to expect it to do anything remotely similar to that.
Tbc it doesn’t need to be literally true. The argument needed for safety is something like “a large team of copies of non-expert agents could together be as capable as an expert”. I see the argument “it’s probably possible for a team of agents to mimic one agent thinking for a long time” as mostly an intuition pump for why that might be true.
“As capable as an expert” makes more sense. Part of what’s confusing about “equivalent to a human thinking for a long time” is that it’s picking out one very particular way of achieving high capability, but really it’s trying to point to a more-general notion of “HCH can solve lots of problems well”. Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.
(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)
Yes sorry — I’m aware that in the HCH procedure no one human thinks for a long time. I’m generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could “effectively replicate the benefits you could get from having a human thinking a long time,” in terms of the role that it plays in an overall scheme for alignment. This isn’t guaranteed to work out, of course. My position is similar to Rohin’s above:
I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.
I expect any version of “align narrowly superhuman models” which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment
There are plenty of problems where evaluating a solution is way way easier than finding the solution. I’m doubtful that the model could somehow produce a “looks good to a human but doesn’t work” solution to “what is a room-temperature superconductor?”. I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, “looks good to a human” and “works” are pretty closely related to each other.
Hm, interesting, I’m actually worried about a totally different implication of “you get what you can measure.”
E.g.:
“If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide—are the humans allowed to say “hold on, I don’t want that,” or are we just going to accept that as what peak performance looks like? So anyhow I’m pessimistic about sandwiching for moral questions.”
I’m curious if the upvote disparity means I’m the minority position here :P
I think one argument running through a lot of the sequences is that the parts of “human values” which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as “moral questions”. Like, these examples from your comment below:
Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?
If an AGI is hung up on these sorts of questions, then we’ve already mostly-won. That’s already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It’s already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It’s already an AI which is unlikely to flood a room in order to fill a cauldron with water.
The vast majority of human values are not things we typically think of as “moral questions”; they’re things which are so obvious that we usually don’t even think of them until they’re pointed out. But they’re still value judgements, and we can’t expect an AGI to share those value judgements by default. If we’re down to the sorts of things people usually think of as moral questions, then the vast majority of human values have already been solved.
Given that this is LW, and this was a major takeaway of the sequences (or at least it was for me), I’d guess that’s probably a fairly common background assumption.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
Anyhow, my point was more: You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
We can’t mostly-win just by fine-tuning a language model to do moral discourse.
Uh… yeah, I agree with that statement, but I don’t really see how it’s relevant. If we tune a language model to do moral discourse, then won’t it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like “they said they want fusion power, but they probably also want it to not be turn-into-bomb-able”.
Or are you using “moral discourse” in a broader sense?
You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
I disagree with the exact phrasing “fact of the matter for whether decisions are good or bad”; I’m not supposing there is any “fact of the matter”. It’s hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.
Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
English sentences don’t have to hold up to optimization pressure, our AI designs do. If I say “I’m hungry for pizza after I work out”, you could say “that doesn’t hold up to optimization pressure—I can imagine universes where you’re not hungry for pizza”, it’s like… okay, but that misses the point? There’s an implicit notion here of “if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won.”
Perhaps this notion isn’t obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.
Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say “this seems true in the main, although I can imagine situations where it’s not.” Maybe this is what you meant, in which case I agree.
First and foremost, great post! “How do we get GPT to give the best health advice it can give?” is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.
Now on to the more interesting part: how does this general strategy fail horribly?
If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far and away most likely is roughly “you get what you can measure” or “you get something designed to look good to human supervisors without actually being good”. In other words, the inability of humans to reliably/robustly evaluate outcomes is the big problem. (The Fusion Power Generator Scenario is a one good example of the type of failure I’m talking about here—the human doesn’t understand what-they-want at a detailed enough level to even ask the right questions, let alone actually evaluate a design.)
So: I expect any version of “align narrowly superhuman models” which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment. Worse, I expect that those are exactly the sort of projects which will produce the most impressive demos, potentially attract investors, etc. After all, their outputs are optimized for looking good to humans (without actually being good) - of course they’re going to look good to human investors, engineers, etc!
Now, what’s really interesting about this piece is that you propose at least one approach—the sandwich method—explicitly addressing that failure mode. Personally, I found that idea the most interesting and promising part of this whole piece. I’ll even register a prediction: 80% the “sandwich problem” cannot be solved in a domain-generalizable way without major conceptual progress on (outer) alignment. Though I would not be surprised if attempts to solve the sandwich problem failed in ways which directly led to at least some conceptual progress, and I do still expect empirical work on the problem is likely to be valuable for that reason. (Though I also expect that a lot of people will read the description of the sandwich problem and fail to understand the requirement which makes it interesting—namely, that the experts be completely and totally absent from the training process, and in particular no data from experts should be involved in the training process.) If I thought I had a generalizable method capable of solving the sandwich problem, it would probably be the highest-priority thing on my agenda.
Thanks for the comment! Just want to explicitly pull out and endorse this part:
I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the “sandwich” problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).
I also broadly agree with you that “things looking good to humans without actually being good” is a major problem to watch out for. But I don’t think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)
I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it’s a lot easier to make something which looks impressive than something which solves a Hard problem (like the sandwich problem), and therefore most impressive-looking “solutions” will probably circumvent the key part of the problem. And if the Hard problem is indeed hard enough to not be solved by anyone, the most impressive-looking results will be those which look good without actually solving it.
I guess the crux here is “And if the Hard problem is indeed hard enough to not be solved by anyone,” — I don’t think that’s the default/expected outcome. There hasn’t been that much effort on this problem in the scheme of things, and I think we don’t know where it ranges from “pretty easy” to “very hard” right now.
Ah… I think we have an enormous amount of evidence on very-similar problems.
For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn’t know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the “sandwich problem” would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don’t think we have a good solution in practice; I’d expect the expert business-owner to usually come up with a much better contract.
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn’t understand what the designer wants), versus a product designer who’s also a fluent coder and familiar with the code base. I’ve experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I’ve seen this one first-hand too.)
One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can’t write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there’s hope in the AI case (e.g. that’s a hope behind iterated amplification).
How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.
Yeah, sorry, that’s right, I was speaking pretty loosely. You’d still have the same hope—maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.
Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don’t see any reason at all to expect it to do anything remotely similar to that.
The intuition for it is something like this: suppose I’m trying to make a difficult decision, like where to buy a house. There are hundreds of cities I’d be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.
If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of “holistic judgment of neighborhood X”, and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.
I see, so it’s basically assuming that problems factor.
Yeah, in the context of a larger alignment scheme, it’s assuming that in particular the problem of answering the question “How good is the AI’s proposed action?” will factor down into sub-questions of manageable size.
I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)
That’s what I have in mind. If all goes well you can think of it like “a human thinking a long time.” We don’t know if all will go well.
It’s also not really clear what “a human thinking 10,000 years” means, HCH is kind of an operationalization of that, but there’s a presumption of alignment in the human-thinking-a-long-time that we don’t get for free here. (Of course you also wouldn’t get it for free if you somehow let a human live for 10,000 years...)
Well, Paul’s original post presents HCH as the specification of a human enlightened judgement.
And if we follow the links to Paul’s previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
So it looks to me like “HCH captures the judgment of the human after thinking from a long time” is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don’t know the answer.
A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.
I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:
… I don’t really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.
Tbc it doesn’t need to be literally true. The argument needed for safety is something like “a large team of copies of non-expert agents could together be as capable as an expert”. I see the argument “it’s probably possible for a team of agents to mimic one agent thinking for a long time” as mostly an intuition pump for why that might be true.
“As capable as an expert” makes more sense. Part of what’s confusing about “equivalent to a human thinking for a long time” is that it’s picking out one very particular way of achieving high capability, but really it’s trying to point to a more-general notion of “HCH can solve lots of problems well”. Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
Yes, I explicitly agree with this, which is why the first thing in my previous response was
My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.
HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.
(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)
Yes sorry — I’m aware that in the HCH procedure no one human thinks for a long time. I’m generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could “effectively replicate the benefits you could get from having a human thinking a long time,” in terms of the role that it plays in an overall scheme for alignment. This isn’t guaranteed to work out, of course. My position is similar to Rohin’s above:
There are plenty of problems where evaluating a solution is way way easier than finding the solution. I’m doubtful that the model could somehow produce a “looks good to a human but doesn’t work” solution to “what is a room-temperature superconductor?”. I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, “looks good to a human” and “works” are pretty closely related to each other.
Hm, interesting, I’m actually worried about a totally different implication of “you get what you can measure.”
E.g.:
“If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide—are the humans allowed to say “hold on, I don’t want that,” or are we just going to accept that as what peak performance looks like? So anyhow I’m pessimistic about sandwiching for moral questions.”
I’m curious if the upvote disparity means I’m the minority position here :P
I think one argument running through a lot of the sequences is that the parts of “human values” which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as “moral questions”. Like, these examples from your comment below:
If an AGI is hung up on these sorts of questions, then we’ve already mostly-won. That’s already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It’s already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It’s already an AI which is unlikely to flood a room in order to fill a cauldron with water.
The vast majority of human values are not things we typically think of as “moral questions”; they’re things which are so obvious that we usually don’t even think of them until they’re pointed out. But they’re still value judgements, and we can’t expect an AGI to share those value judgements by default. If we’re down to the sorts of things people usually think of as moral questions, then the vast majority of human values have already been solved.
Given that this is LW, and this was a major takeaway of the sequences (or at least it was for me), I’d guess that’s probably a fairly common background assumption.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
Anyhow, my point was more: You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
Uh… yeah, I agree with that statement, but I don’t really see how it’s relevant. If we tune a language model to do moral discourse, then won’t it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like “they said they want fusion power, but they probably also want it to not be turn-into-bomb-able”.
Or are you using “moral discourse” in a broader sense?
I disagree with the exact phrasing “fact of the matter for whether decisions are good or bad”; I’m not supposing there is any “fact of the matter”. It’s hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.
Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
English sentences don’t have to hold up to optimization pressure, our AI designs do. If I say “I’m hungry for pizza after I work out”, you could say “that doesn’t hold up to optimization pressure—I can imagine universes where you’re not hungry for pizza”, it’s like… okay, but that misses the point? There’s an implicit notion here of “if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won.”
Perhaps this notion isn’t obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.
Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say “this seems true in the main, although I can imagine situations where it’s not.” Maybe this is what you meant, in which case I agree.