Wei Dai(Wei Dai)
Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
But we could have said the same thing of SBF, before the disaster happened.
Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.
Please explain your thinking behind this?
Dealing with moral uncertainty is just part of expected utility maximization.
It’s not, because some moral theories are not compatible with EU maximization, and of the ones that are, it’s still unclear how to handle uncertainty between them.
the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.
I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I’m not getting your point from just “approximately Bayesian setup”.)
They needed a giant image classification dataset which I don’t think even existed 5 years ago.
This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?
Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the “misalignment” between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target (“humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias”).
In either case, it seems kind of bad that it has taken a decade or two to get to this point from when adversarial examples were first noticed, and it’s unclear whether other adversarial examples or “misalignment” remain in the vision transformer. If the first transformative AIs don’t quite learn the right values due to having a different inductive bias from humans, it may not matter much that 10 years later the problem would be solved.
Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?
It’s confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here’s one post where someone said, “It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action.”
Did SBF or Mao Zedong not have a pointer to the right values, or had a right pointer but made mistakes due to computational issues (i.e., would have avoided causing the disasters that they did if they were smarter and/or had more time to think)? Both seem possible to me, so I’d like to understand how the QACI approach would solve (or rule out) both of these potential problems:
If many humans don’t have pointers to right values, how to make sure QACI gets a pointer from humans who have a pointer to the right values?
How to make sure that AI will not make some catastrophic mistake while it’s not smart enough to fully understand the values we give it, while still being confident enough in its guesses of what to do in the short term to do useful things?
Moral uncertainty is an area in philosophy with ongoing research, and assuming that AI will handle it correctly by default seems unsafe, similar to assuming that AI will have the right decision theory by default.
I see that Tasmin Leake also pointed out 2 above as a potential problem, but I don’t see anything that looks like a potential solution at QACI table of contents.
Katja Grace notes that image synthesis methods have no trouble generating photorealistic human faces.
They’re terrible at hands though (which has ruined many otherwise good images for me). That post used Stable Diffusion 1.5, but even the latest SD 3.0 (with versions 2.0, 2.1, XL, Stable Cascade in between) is still terrible at it.
Don’t really know how relevant this is to your point/question about fragility of human values, but thought I’d mention it since it seems plausibly as relevant as AIs being able to generate photorealistic human faces.
Adversarial examples suggest to me that by default ML systems don’t necessarily learn what we want them to learn:
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
They don’t handle contradictory evidence in a reasonable way, i.e., giving a confident answer when high frequency features (pixel-level details) and low frequency features (overall shape) point to different answers.
The evidence from adversarial training suggests to me that AT is merely patching symptoms (e.g., making the ML system de-emphasize certain specific features) and not fixing the underlying problem. At least this is my impression from watching this video on Adversarial Robustness, specifically the chapters on Adversarial Arms Race and Unforeseen Adversaries.
Aside from this, it’s also unclear how to apply AT to your original motivation:
A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.
because in order to apply AT we need a model of what “attacks” the adversary is allowed to do (in this case the “attacker” is a superintelligence trying to optimize the universe, so we have to model it as being allowed to do anything?) and also ground-truth training labels.
For this purpose, I don’t think we can use the standard AT practice of assuming that any data point within a certain distance of a human-labeled instance, according to some metric, has the same label as that instance. Suppose we instead let the training process query humans directly for training labels (i.e., how good some situation is) on arbitrary data points, well that’s slow/costly if the process isn’t very sample efficient (which modern ML isn’t), and also scary if human implementations of human values may already have adversarial examples. (The “perceptual wormholes” work and other evidence suggest that humans also aren’t 100% adversarially robust.)
My own thinking is that we probably need to go beyond adversarial training for this, along the lines of solving metaphilosophy and then using that solution to find/fix existing adversarial examples and correctly generalize human values out of distribution.
I’m confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law… This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
If something is both a vanguard and limited, then it seemingly can’t stay a vanguard for long. I see a few different scenarios going forward:
We pause AI development while LLMs are still the vanguard.
The data limitation is overcome with something like IDA or Debate.
LLMs are overtaken by another AI technology, perhaps based on RL.
In terms of relative safety, it’s probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I’m not getting very optimistic about AI safety just yet.
The argument is that with 1970′s tech the soviet union collapsed, however with 2020 computer tech (not needing GenAI) it would not.
I note that China is still doing market economics, and nobody is trying (or even advocating, AFAIK) some very ambitious centrally planned economy using modern computers, so this seems like pure speculation? Has someone actually made a detailed argument about this, or at least has the agreement of some people with reasonable economics intuitions?
I’ve arguably lived under totalitarianism (depending on how you define it), and my parents definitely have and told me many stories about it. I think AGI increases risk of totalitarianism, and support a pause in part to have more time to figure out how to make the AI transition go well in that regard.
Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to “achieve high counterfactual impact in your own research” (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I’m a bit confused why you’re asking people to think in the direction outlined in the OP.
Some of my considerations for college choice for my kid, that I suspect others may also want to think more about or discuss:
status/signaling benefits for the parents (This is probably a major consideration for many parents to push their kids into elite schools. How much do you endorse it?)
sex ratio at the school and its effect on the local “dating culture”
political/ideological indoctrination by professors/peers
workload (having more/less time/energy to pursue one’s own interests)
I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?
I mostly offer this in the spirit of “here’s the only way I can see to reconcile subjective anticipation with UDT at all”, not “here’s something which makes any sense mechanistically or which I can justify on intuitive grounds”.
Ah I see. I think this is incomplete even for that purpose, because “subjective anticipation” to me also includes “I currently see X, what should I expect to see in the future?” and not just “What should I expect to see, unconditionally?” (See the link earlier about UDASSA not dealing with subjective anticipation.)
ETA: Currently I’m basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...
As you would expect, I strongly favor (1) over (2) over (3), with (3) being far, far worse for ‘eating your whole childhood’ reasons.
Is this actually true? China has (1) (affirmative action via “Express and objective (i.e., points and quotas)”) for its minorities and different regions and FWICT the college admissions “eating your whole childhood” problem over there is way worse. Of course that could be despite (1) not because of it, but does make me question whether (3) (“Implied and subjective (‘we look at the whole person’).”) is actually far worse than (1) for this.
Intuitively this feels super weird and unjustified, but it does make the “prediction” that we’d find ourselves in a place with high marginal utility of money, as we currently do.
This is particularly weird because your indexical probability then depends on what kind of bet you’re offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me… (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?)
By “acausal games” do you mean a generalization of acausal trade?
Yes, didn’t want to just say “acausal trade” in case threats/war is also a big thing.
This was all kinda rambly but I think I can summarize it as “Isn’t it weird that ADT tells us that we should act as if we’ll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don’t have a story for why these things are related but it does seem like a suspicious coincidence.”
I’m not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I’m assuming ADT is similar) is “Don’t think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over.”
This seems to “work” but anthropics still feels mysterious, i.e., we want an explanation of “why are we who we are / where we’re at” and it’s unsatisfying to “just don’t think about it”. UDASSA does give an explanation of that (but is also unsatisfying because it doesn’t deal with anticipations, and also is disconnected from decision theory).
I would say that under UDASSA, it’s perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).
Yikes, I’m not even comfortable maximizing my own CEV. One crux may be that I think a human’s values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn’t have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human’s reflection process which may be mistaken or selfish or skewed.
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?