Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI’s values in a way we like. But that they are in there at all sure seems like it’d make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won’t kill everyone.
By ‘good interpretability’, I don’t necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI’s goals, by default, don’t need to be explicitly represented objects within the parameter structure of a single forward pass.
I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it’s a key reason for the thing people call “value alignment” being incoherent in the first place.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn’t matter whether these stated values are ‘incoherent’ in the sense of not being in tune with actual human behavior, they’re useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don’t couple in the sense of being the revealed-preferences in an agentic model of the humans’ actions.
Every time a human tries and mostly fails to explain what things they’d like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you’re trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn’t a single correct answer, what the space of valid answers looks like.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn’t be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn’t look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it’s better for them to remain in blissful ignorance for now.
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI’s values in a way we like. But that they are in there at all sure seems like it’d make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won’t kill everyone.
By ‘good interpretability’, I don’t necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI’s goals, by default, don’t need to be explicitly represented objects within the parameter structure of a single forward pass.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it’s a key reason for the thing people call “value alignment” being incoherent in the first place.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn’t matter whether these stated values are ‘incoherent’ in the sense of not being in tune with actual human behavior, they’re useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don’t couple in the sense of being the revealed-preferences in an agentic model of the humans’ actions.
Every time a human tries and mostly fails to explain what things they’d like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you’re trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn’t a single correct answer, what the space of valid answers looks like.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn’t be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn’t look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it’s better for them to remain in blissful ignorance for now.