I think human values have a very simple and theoretically predictable basis: they’re derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. “Don’t kill everyone!” can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) accuracy is pretty simple. E.g. if the there was a guide for alien zookeepers (who were already familiar with Terran biochemistry) on how to keep humans, how long would that need to be for the humans to mostly survive in captivity? I’m guessing a single textbook could do a good job of this, maybe even just a long chapter in a textbook.
However, I think there is a lot more complexity in the finer/subtler details, much of which is biological in nature, starting with the specific grab-bag of heuristics that evolution happened to land on and their tuning, then with even more sociological/cultural/historical complexity layered on top. So where I think the complexity ramps up a lot is if you want to do a really good job of modelling human values accurately in all their detail, as we would clearly prefer our ASIs to do. If you look through the Dewey Decimal system, roughly half the content of any general-purpose library is devoted to sub-specialities of “how to make humans happy”. However, LLMs are good at learning large amounts of complex, nuanced information. So an LLM knowing how to make humans happy in a lot of detail is not that surprising: in general, modern LLMs display detailed knowledge of this material.
The challenging part is ensuring that an LLM-powered agent cares about making humans happy, more than, say, a typical human autocrat does. Base model LLMs are “distilled” from many humans, so they absorb humans’ capability for consideration for others, and also humans’ less aligned traits like competitiveness and ambition. The question then is how to ensure which of these dominate, and how reliably, in agents powered by an instruct-trained LLM.
I think the key crux is this in my view is basically unnecessary:
However, I think there is a lot more complexity in the finer/subtler details, most of which is biological in nature, starting with the specific grab-bag of heuristics that evolution happened to land on and their tuning, with even more sociological/cultural/historical complexity layered on top. So where I think the complexity ramps up a lot is if you want to do a really good job of modelling human values accurately, as we would clearly prefer our ASIs to do.
@Steven Byrnes talks about how the mechanisms used in human brains might be horrifically complicated, but that the function is simple enough that you can code it quite well and robustly for AIs, and my difference from @Steven Byrnes is that I believe that this basically also works for the things that make humans have values, like the social learning parts of our brains.
Thus it’s a bit of a conditional claim, in that either the mechanism used in human brains is also simple, or that we can simplify it radically to preserve the core function while discarding the unnecessary (in my view) complexity, and that’s the takeaway I have from LLMs learning human values.
Link and quote below:
In other words, the brain’s implementation of that thing can be super-complicated, but the input-output relation cannot be that complicated—at least, the useful part of the input-output relation cannot be that complicated.
The crustacean stomatogastric ganglion central pattern generators discussed above are a great example: their mechanisms are horrifically complicated, but their function is simple: they create a rhythmic oscillation. Hey, you need a rhythmic oscillation in your AGI? No problem! I can do that in one line of Python.
Also, a question for this quote is what’s the assumed capability/compute level used in this thought experiment?
E.g. if the there was an guide for alien zookeepers (ones already familiar with Terran biochemistry) on how to keep humans, how long would it need to be for the humans to mostly survive?
The level of understanding of and caring about human values required to not kill everyone and be able to keep many humans alive, is actually pretty low (especially on the knowledge side).
That’s also basically sufficient to motivate wanting to learn more about human values, and being able to, so then the Value Learning process then kicks in: a competent and caring alien zookeeper would want to learn more about their charges’ needs.
We have entire libraries half of whose content is devoted to “how to make humans happy”, and we already fed most of them into our LLMs as training material. On a factual basis, knowing how to make humans happy in quite a lot of detail (and for a RAG agent, looking up details they don’t already have memorized) is clearly well within their capabilities. The part that concerns me is the caring side, and that’s not conceptually complicated: roughly speaking, the question is how to ensure an agent’s selfless caring for humans is consistently a significantly stronger motivation than various bad habits like ambition, competitiveness, and powerseeking that it either picked up from us during the “distillation” of the base model, and/or learnt during RL training.
Also, a question for this quote is what’s the assumed capability/compute level used in this thought experiment?
E.g. if the there was an guide for alien zookeepers (ones already familiar with Terran biochemistry) on how to keep humans, how long would it need to be for the humans to mostly survive?
ASI, or high AGI: capable enough that we’ve lost control and alignment is an existential risk.
ASI, or high AGI: capable enough that we’ve lost control and alignment is an existential risk.
Then the answer is probably kilobytes to megabytes, but at any rate the guide for alien zookeepers can be very short, and that the rest can be learned from data.
I like your point that humans aren’t aligned, and while I’m more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.
Weirdly enough, I think getting aligned superintelligence is both harder and easier than you are, and I’m defining alignment like you, in which we could have a superintelligence deployed into the world that cared at least for humans totally and doesn’t need restraints on it’s power like law enforcement or government of superintelligences.
The thing that makes alignment harder is I believe achieving FOOM for AIs, while unlikely, isn’t obviously impossible, and I believe right around the cusp when AIs start to automate research without humans in the loop is when I suspect a whole lot of algorithmic progress will be done, and the only real bottlenecks are power and physical interfaces like robotics, and if these are easy/very easy to solve, I see fast FOOM as being very plausible.
The thing that makes alignment easier is that currently, alignment generalizes more than capabilities, which is good for us, and it’s looking like influencing an AI’s values through it’s data is far easier than making it have great capabilities like being an autonomous researcher for deep reasons, which means we could get by on smaller data quantities assuming very high sample efficiency:
> In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
I like your point that humans aren’t aligned, and while I’m more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.
The most obvious natural experiments about what humans do when they have a lot of power with no checks-and-balances are autocracies. While there are occasional examples (such as Singapore) of autocracies that didn’t work out too badly for the governed, they’re sadly few and far between. The obvious question then is whether “humans who become autocrats” are a representative random sample of all humans, or if there’s a strong selection bias here. It seems entirely plausible that there’s at least some selection effects in the process of becoming an autocrat. A couple of percent of all humans are sociopaths, so if there were a sufficiently strong (two orders of magnitude or more) selection bias, then this might, for example, be a natural experiment about the alignment properties of a set of humans consisting mostly of sociopaths, in which case it usually going badly would be unsurprising.
The thing that concerns me is the aphorism “Power corrupts, and absolute power corrupts absolutely”. There does seem to be a strong correlation between how long someone has had a lot of power and an increasing likelihood of them using it badly. That’s one of the reasons for term limits in positions like president: humans seem to pretty instinctively not trust a leader after they’ve been in a position of a lot of power with few check-and-balances for roughly a decade. The histories of autocracies tend to reflect them getting worse over time, on decade time-scales. So I don’t think the problem here is just from sociopaths. I think the proportion of humans who wouldn’t eventually be corrupted by a lot of power with no checks-and-balances may be fairly low, comparable to the proportion of honest senior politicians, say.
How much of this argument applies to ASI agents powered by LLMs “distilled” from humans is unclear — it’s much more obviously applicable to uploads of humans that then get upgraded to super-human capabilities.
IMO, there are fairly strong arguments that there is a pretty bad selection effect for people who aim to get into power generally being more Machiavellian/Sociopathic than other people, and at least part of the problem is that the parts of your brain that cares about other people gets damaged when you gain power, which is obviously not good.
But still, I agree with you that an ASI that can entirely run society while only being as aligned as humans are to very distant humans likely ends up in a very bad state for us, possibly enough to be an S-risk or X-risk (I currently see S-risk being more probable than X-risk for ASI if we only had human-level alignment to others.)
I think human values have a very simple and theoretically predictable basis: they’re derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. “Don’t kill everyone!” can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) accuracy is pretty simple. E.g. if the there was a guide for alien zookeepers (who were already familiar with Terran biochemistry) on how to keep humans, how long would that need to be for the humans to mostly survive in captivity? I’m guessing a single textbook could do a good job of this, maybe even just a long chapter in a textbook.
However, I think there is a lot more complexity in the finer/subtler details, much of which is biological in nature, starting with the specific grab-bag of heuristics that evolution happened to land on and their tuning, then with even more sociological/cultural/historical complexity layered on top. So where I think the complexity ramps up a lot is if you want to do a really good job of modelling human values accurately in all their detail, as we would clearly prefer our ASIs to do. If you look through the Dewey Decimal system, roughly half the content of any general-purpose library is devoted to sub-specialities of “how to make humans happy”. However, LLMs are good at learning large amounts of complex, nuanced information. So an LLM knowing how to make humans happy in a lot of detail is not that surprising: in general, modern LLMs display detailed knowledge of this material.
The challenging part is ensuring that an LLM-powered agent cares about making humans happy, more than, say, a typical human autocrat does. Base model LLMs are “distilled” from many humans, so they absorb humans’ capability for consideration for others, and also humans’ less aligned traits like competitiveness and ambition. The question then is how to ensure which of these dominate, and how reliably, in agents powered by an instruct-trained LLM.
I think the key crux is this in my view is basically unnecessary:
@Steven Byrnes talks about how the mechanisms used in human brains might be horrifically complicated, but that the function is simple enough that you can code it quite well and robustly for AIs, and my difference from @Steven Byrnes is that I believe that this basically also works for the things that make humans have values, like the social learning parts of our brains.
Thus it’s a bit of a conditional claim, in that either the mechanism used in human brains is also simple, or that we can simplify it radically to preserve the core function while discarding the unnecessary (in my view) complexity, and that’s the takeaway I have from LLMs learning human values.
Link and quote below:
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_
Also, a question for this quote is what’s the assumed capability/compute level used in this thought experiment?
I basically agree, for three reasons:
The level of understanding of and caring about human values required to not kill everyone and be able to keep many humans alive, is actually pretty low (especially on the knowledge side).
That’s also basically sufficient to motivate wanting to learn more about human values, and being able to, so then the Value Learning process then kicks in: a competent and caring alien zookeeper would want to learn more about their charges’ needs.
We have entire libraries half of whose content is devoted to “how to make humans happy”, and we already fed most of them into our LLMs as training material. On a factual basis, knowing how to make humans happy in quite a lot of detail (and for a RAG agent, looking up details they don’t already have memorized) is clearly well within their capabilities. The part that concerns me is the caring side, and that’s not conceptually complicated: roughly speaking, the question is how to ensure an agent’s selfless caring for humans is consistently a significantly stronger motivation than various bad habits like ambition, competitiveness, and powerseeking that it either picked up from us during the “distillation” of the base model, and/or learnt during RL training.
ASI, or high AGI: capable enough that we’ve lost control and alignment is an existential risk.
Then the answer is probably kilobytes to megabytes, but at any rate the guide for alien zookeepers can be very short, and that the rest can be learned from data.
I like your point that humans aren’t aligned, and while I’m more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.
Weirdly enough, I think getting aligned superintelligence is both harder and easier than you are, and I’m defining alignment like you, in which we could have a superintelligence deployed into the world that cared at least for humans totally and doesn’t need restraints on it’s power like law enforcement or government of superintelligences.
The thing that makes alignment harder is I believe achieving FOOM for AIs, while unlikely, isn’t obviously impossible, and I believe right around the cusp when AIs start to automate research without humans in the loop is when I suspect a whole lot of algorithmic progress will be done, and the only real bottlenecks are power and physical interfaces like robotics, and if these are easy/very easy to solve, I see fast FOOM as being very plausible.
The thing that makes alignment easier is that currently, alignment generalizes more than capabilities, which is good for us, and it’s looking like influencing an AI’s values through it’s data is far easier than making it have great capabilities like being an autonomous researcher for deep reasons, which means we could get by on smaller data quantities assuming very high sample efficiency:
> In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
Link below:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I think that we agree on a lot, and only really disagree on how much data is necessary for a good outcome, if at all we disagree.
The most obvious natural experiments about what humans do when they have a lot of power with no checks-and-balances are autocracies. While there are occasional examples (such as Singapore) of autocracies that didn’t work out too badly for the governed, they’re sadly few and far between. The obvious question then is whether “humans who become autocrats” are a representative random sample of all humans, or if there’s a strong selection bias here. It seems entirely plausible that there’s at least some selection effects in the process of becoming an autocrat. A couple of percent of all humans are sociopaths, so if there were a sufficiently strong (two orders of magnitude or more) selection bias, then this might, for example, be a natural experiment about the alignment properties of a set of humans consisting mostly of sociopaths, in which case it usually going badly would be unsurprising.
The thing that concerns me is the aphorism “Power corrupts, and absolute power corrupts absolutely”. There does seem to be a strong correlation between how long someone has had a lot of power and an increasing likelihood of them using it badly. That’s one of the reasons for term limits in positions like president: humans seem to pretty instinctively not trust a leader after they’ve been in a position of a lot of power with few check-and-balances for roughly a decade. The histories of autocracies tend to reflect them getting worse over time, on decade time-scales. So I don’t think the problem here is just from sociopaths. I think the proportion of humans who wouldn’t eventually be corrupted by a lot of power with no checks-and-balances may be fairly low, comparable to the proportion of honest senior politicians, say.
How much of this argument applies to ASI agents powered by LLMs “distilled” from humans is unclear — it’s much more obviously applicable to uploads of humans that then get upgraded to super-human capabilities.
IMO, there are fairly strong arguments that there is a pretty bad selection effect for people who aim to get into power generally being more Machiavellian/Sociopathic than other people, and at least part of the problem is that the parts of your brain that cares about other people gets damaged when you gain power, which is obviously not good.
But still, I agree with you that an ASI that can entirely run society while only being as aligned as humans are to very distant humans likely ends up in a very bad state for us, possibly enough to be an S-risk or X-risk (I currently see S-risk being more probable than X-risk for ASI if we only had human-level alignment to others.)