[I’m going to assume that a) our >= AGI is actually capable of doing STEM well (without falling for well-known traps like Goodharting or extrapolation out of distribution that are well-covered in the literature), and b) it is already at least as well aligned as CEV would be if you fed all of the information about complex and fragile human values contained in GPT-5 into GPT-6 along with a prompt defining something like CEV, then let it reason accurately for a long time while also debating with humans, finally then write out a few petabytes of result (something >> the size of the human genome), and then used that as a basis for something along the lines of Constitutional AI, possibly via DPO. Which is still not well enough aligned to trust them to run much out in society, but probably good enough for an initial starting point for this purpose to converge from.]
I think this problem can be somewhat ameliorated if your expert is an expert in two (related) topics:
alignment research, and
modelling, understanding, and communicating with humans in a principled and non-manipulative way. Preferably humans who, by human standards, are pretty smart and know a lot about alignment research.
If there’s one thing LLMs are extremely good at, it’s modelling humans’ behavior. We also have a huge amount of research and literature into what humans want, it’s called Economics/Ergonomics/Design/Anthropology/Data Science for Advertising/Psychology/Culinary Science/Medicine/Sociology/etc. — basically all of the soft sciences. A set of sufficiently smart Experts should be able to pick up the entire literature on all that and run with it. The tricky part IMO is “given ways to make humans want X, which are/are not legitimate to apply, for different X?” (which varies on how bad+tempting not-X is for humans). For more on this, see my sequence on AI, Alignment, and Ethics.
Once it knows what humans want, and what is/isn’t OK to do when manipulating them for their own good, then it’s a matter of how does one build a software system to optimize the parts of that we’re currently confident of while steering clear of the parts we’re not yet certain of (so a value-learning approximately-Bayesian version of AIXI): a question in Bayesianism and software that’s sufficiently similar to “How do you construct a competent STEM researcher?” that we clearly will have to already understand it before we get to this point.
So I’m basically thinking we start off in AI-assisted Alignment while our experts aren’t that much smarter than our human researcher clients, and gradually transition over to Value Learning + some debate/amplifiation as the differential gets bigger and the problems you describe become harder to surmount.
[I’m going to assume that a) our >= AGI is actually capable of doing STEM well (without falling for well-known traps like Goodharting or extrapolation out of distribution that are well-covered in the literature), and b) it is already at least as well aligned as CEV would be if you fed all of the information about complex and fragile human values contained in GPT-5 into GPT-6 along with a prompt defining something like CEV, then let it reason accurately for a long time while also debating with humans, finally then write out a few petabytes of result (something >> the size of the human genome), and then used that as a basis for something along the lines of Constitutional AI, possibly via DPO. Which is still not well enough aligned to trust them to run much out in society, but probably good enough for an initial starting point for this purpose to converge from.]
I think this problem can be somewhat ameliorated if your expert is an expert in two (related) topics:
alignment research, and
modelling, understanding, and communicating with humans in a principled and non-manipulative way. Preferably humans who, by human standards, are pretty smart and know a lot about alignment research.
If there’s one thing LLMs are extremely good at, it’s modelling humans’ behavior. We also have a huge amount of research and literature into what humans want, it’s called Economics/Ergonomics/Design/Anthropology/Data Science for Advertising/Psychology/Culinary Science/Medicine/Sociology/etc. — basically all of the soft sciences. A set of sufficiently smart Experts should be able to pick up the entire literature on all that and run with it. The tricky part IMO is “given ways to make humans want X, which are/are not legitimate to apply, for different X?” (which varies on how bad+tempting not-X is for humans). For more on this, see my sequence on AI, Alignment, and Ethics.
Once it knows what humans want, and what is/isn’t OK to do when manipulating them for their own good, then it’s a matter of how does one build a software system to optimize the parts of that we’re currently confident of while steering clear of the parts we’re not yet certain of (so a value-learning approximately-Bayesian version of AIXI): a question in Bayesianism and software that’s sufficiently similar to “How do you construct a competent STEM researcher?” that we clearly will have to already understand it before we get to this point.
So I’m basically thinking we start off in AI-assisted Alignment while our experts aren’t that much smarter than our human researcher clients, and gradually transition over to Value Learning + some debate/amplifiation as the differential gets bigger and the problems you describe become harder to surmount.