I wrote some thoughts that look like they won’t get posted anywhere else, so I’m just going to paste them here with minimal editing:
They (ARC) seem to imagine that for all the cases that matter, there’s some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).
In contrast, I don’t think there’s some “state of knowing all the facts” the human can be in. There are multiple ways of “giving the human the facts” that can lead to different judgments in even mildly interesting cases. And we’re stuck with this indeterminacy—trying to get rid of it seems to me like saying “no, I mean just use the true model of the axioms.”
I think one intuitive response to this is to say “okay, so let’s put a measure over ways to inform humans, and then sample or take some sort of average.” But I think this is trying too hard to plow straight ahead while assuming humans are on average sensible/reliable. Instead there are going to be some cases where humans tend to converge on a small number of sensible answers, and there we can go ahead and do some kind of averaging or take the mode, there are some other cases where humans have important higher-order preferences about how they want certain information to be processed and would call the naive average biased or bad, and there are other cases where humans don’t converge well at all, and we want the AI to not just plow ahead with an average, but to notice that humans are being incompetent and not put much weight on their opinion.
Generally we are asking for an AI that doesn’t give an unambiguously bad answer, and if there’s any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn’t unambiguously bad and we’re fine if the AI gives it.
There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it’s catastrophic for our AI not to make the “correct” judgment. I’m not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples.
For example, note that ELK is never trying to answer any questions of the form “how good is this outcome?”; I certainly agree that there can also be ambiguity about questions like “did the diamond stay in the room?” but it’s a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/can’t tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.
When you say “some case in which a human might make different judgments, but where it’s catastrophic for the AI not to make the correct judgment,” what I hear is “some case where humans would sometimes make catastrophic judgments.”
I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.
I think the problem you’re getting at here is real—path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC’s ELK problem is not claiming this isn’t a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don’t have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).
I wrote some thoughts that look like they won’t get posted anywhere else, so I’m just going to paste them here with minimal editing:
They (ARC) seem to imagine that for all the cases that matter, there’s some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).
In contrast, I don’t think there’s some “state of knowing all the facts” the human can be in. There are multiple ways of “giving the human the facts” that can lead to different judgments in even mildly interesting cases. And we’re stuck with this indeterminacy—trying to get rid of it seems to me like saying “no, I mean just use the true model of the axioms.”
I think one intuitive response to this is to say “okay, so let’s put a measure over ways to inform humans, and then sample or take some sort of average.” But I think this is trying too hard to plow straight ahead while assuming humans are on average sensible/reliable. Instead there are going to be some cases where humans tend to converge on a small number of sensible answers, and there we can go ahead and do some kind of averaging or take the mode, there are some other cases where humans have important higher-order preferences about how they want certain information to be processed and would call the naive average biased or bad, and there are other cases where humans don’t converge well at all, and we want the AI to not just plow ahead with an average, but to notice that humans are being incompetent and not put much weight on their opinion.
Generally we are asking for an AI that doesn’t give an unambiguously bad answer, and if there’s any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn’t unambiguously bad and we’re fine if the AI gives it.
There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it’s catastrophic for our AI not to make the “correct” judgment. I’m not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples.
For example, note that ELK is never trying to answer any questions of the form “how good is this outcome?”; I certainly agree that there can also be ambiguity about questions like “did the diamond stay in the room?” but it’s a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/can’t tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.
When you say “some case in which a human might make different judgments, but where it’s catastrophic for the AI not to make the correct judgment,” what I hear is “some case where humans would sometimes make catastrophic judgments.”
I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.
I think the problem you’re getting at here is real—path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC’s ELK problem is not claiming this isn’t a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don’t have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).