PhD student at MIT (ProbComp / CoCoSci), working on probabilistic programming for agent understanding and value alignment.
xuan
While I’ve focused on death here, I think this is actually much more general—there are a lot of irreversible decisions that people make (and that artificial agents might make) between potentially incommensurable choices. Here’s a nice example from Elizabeth Anderson’s “Value in Ethics & Economics” (Ch. 3, P57 re: the question of how one should live one’s life, to which I think irreversibility applies
Similar incommensurability applies, I think, to what kind of society we collectively we want to live in, given that path dependency makes many choices irreversible.
Interesting argument! I think it goes through—but only under certain ecological / environmental assumptions:
That decisions / trades between goods are reversible.
That there are multiple opportunities to make such trades / decisions in the environment.
But this isn’t always the case! Consider:
Both John and David prefer living over dying.
Hence, John would not trade (John Alive, David Dead) for (John Dead, David Alive), and vice versa for David.
This is already a case of weakly incomplete preferences which, while technically reducible to a complete order over “indifference sets”, doesn’t seem well described by a utility function! In particular, it seems really important to represent the fact that neither person would trade their life for the other’s life, even though both (John Alive, David Dead) and (John Dead, David Alive) lie in the same “indifference / incommensurability set”.
(I think it’s better to call it an “incommensurability set”—just because two elements in a lattice share a least upper bound, it doesn’t mean they are themselves comparable).Now let’s try and make the preferences strongly incomplete:
John prefers living freely over imprisonment, and imprisonment to dying.
Even if David was dead, he would prefer that John be alive over John being imprisoned.
Apart from the fact that you can’t reverse death (at least with current technology), this is similar to the pizza scenario: The system as a whole prefers:
(John Free, David Alive) > (John Free, David Dead) > (John Imprisoned, David Dead) > Both Dead
(John Free, David Alive) > (John Imprisoned, David Alive) > (John Dead, David Alive) > Both Dead
No preferences between options of the form (X, David Dead) and (John Dead, Y).
If John and David could contract to go from (John Imprisoned, David Dead) to (John Dead, David Alive) and then to (John Alive, David Dead) when those trades are offered, that would result in an improvement in achieving preferred outcomes on average. But of course, they can’t because death is irreversible!
Not sure if this is the same as the awards contest entry, but EJT also made this earlier post (“There are no coherence theorems”) arguing that certain Dutch Book / money pump arguments against incompleteness fail!
Very interesting work! This is only a half-formed thought, but the diagrams you’ve created very much remind me of similar diagrams used to display learned “topics” in classic topic models like Latent Dirichlet Allocation (Figure 8 from the paper is below):
I think there’s possibly something to be gained by viewing what the MLPs and attention heads are learning as something like “topic models”—and it may be the case that some of the methods developed for evaluating topic interpretability and consistency will be valuable here. A couple of references:
Reading Tea Leaves: How Humans Interpret Topic Models (Chang et. al. 2009)
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality (Lau, Newman & Baldwin, 2014)
Great to know, and good to hear!
Regarding causal scrubbing in particular, it seems to me that there’s a closely related line of research by Geiger, Icard and Potts that it doesn’t seem like TAISIC is engaging with deeply? I haven’t looked too closely, but it may be another example of duplicated effort / rediscovery:
The importance of interventions
Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern 2019, Beckers et al. 2019) provides a powerful toolkit for achieving the desired kinds of explanation in AI. In causal abstraction, we assess whether a particular high-level (possibly symbolic) mode H is a faithful proxy for a lower-level (in our setting, usually neural) model N in the sense that the causal effects of components in H summarize the causal effects of components of N. In this scenario, N is the AI model that has been deployed to solve a particular task, and H is one’s probably partial, high-level characterization of how the task domain works (or should work). Where this relationship between N and H holds, we say that H is a causal abstraction of N. This means that we can use H to directly engage with high-level questions of robustness, fairness, and safety in deploying N for real-world tasks.
Source: https://ai.stanford.edu/blog/causal-abstraction/
Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn’t engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don’t work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.
Fascinating evidence!
I suspect this maybe because RLHF elicits a singular scale of “goodness” judgements from humans, instead of a plurality of “goodness-of-a-kind” judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far. And if humans are asked to give only a singular “goodness” rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.
If the above is true, one corollary is that we should expect to see less mode collapse if one finetunes a language model on ratings elicited using a diversity of instructions (e.g. is this completion interesting? helpful? accurate?), and perhaps use some kind of imitation-learning inspired objective to mimic that distribution, rather than PPO (which is meant to only optimize for a singular reward function instead of a distribution over reward functions).
Apologies for the belated reply.
Yes, the summary you gave above checks out with what I took away from your post. I think it sounds good on a high level, but still too vague / high-level for me to say much in more detail. Values/ethics are definitely a system (e.g., one might think that morality was evolved by humans for the purposes of co-operation), but at the end of the day you’re going to have to make some concrete hypothesis about what that system is in order to make progress. Contractualism is one such concrete hypothesis, and folding ethics under the broader scope of normative reasoning is another way to understand the underlying logic of ethical reasoning. Moral naturalism is another way of going “beyond human values”, because it argues that statements about ethics can be reduced to statements about the natural world.
Hopefully this is helpful food for thought!
Hmm, I’m not sure I fully understand the concept of “X statements” you’re trying to introduce, though it does feel similar in some ways to contractualist reasoning. Since the concept is still pretty vague to me, I don’t feel like I can say much about it, beyond mentioning several ideas / concepts that might be related:
- Immanent critique (a way of pointing out the contradictions in existing systems / rules)
- Reasons for action (especially justificatory reasons)
- Moral naturalism (the meta-ethical position that moral statements are statements about the natural world)
Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans. I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules.
I think there’s something to this, but I think perhaps it only applies strongly if and when most of the economy is run by or delegated to AI services? My intuition is that for the near-to-medium term, AI systems will mostly be used to aid / augment humans in existing tasks and services (e.g. the list in the section on Designing roles and norms), for which we can either either use existing laws and norms, or extensions of them. If we are successful in applying that alignment approach in the near-to-medium term, as well as the associated governance problems, then it seems to me that we can much more carefully control the transition to a mostly-automated economy as well, giving us leeway to gradually adjust our norms and laws.No doubt, that’s a big “if”. If the transition to a mostly/fully-automated economy is sharper than laid out above, then I think your concerns about norm/contract learning are very relevant (but also that the preference-based alternative is more difficult still). And if we did end up with a single actor like OpenAI building transformative AI before everyone else, my recommendation would be still be to adopt something like the pluralistic approach outlined here, perhaps by gradually introducing AI systems into well-understood and well-governed social and institutional roles, rather than initiating a sharp shift to a fully-automated economy.
While listening to the latest Inside View podcast, it occurred to me that this perspective on AI safety has some natural advantages when translating into regulation that present governments might be able to implement to prepare for the future. If AI governance people aren’t already thinking about this, maybe bother some / convince people in this comment section to bother some?
Yes, it seems like a number of AI policy people at least noticed the tweet I made about this talk! If you have suggestions for who in particular I should get the attention of, do let me know.
But here I would expect people to reasonably disagree on whether an AI system or community of systems has made a good decision, and therefore it seems harder to ever fully trust machines to make decisions at this level.
I hope the above is at least partially addressed by the last paragraph of the section on Reverse Engineering Roles and Norms! I agree with the worry, and to address it I think we could design systems that mostly just propose revisions or extrapolations to our current rules, or highlight inconsistencies among them (e.g. conflicting laws), thereby aiding a collective-intelligence-like democratic process of updating our rules and norms (of the form described in the Collective Governance section), where AI systems facilitate but do not enact normative change.
Note that if AI systems represent uncertainty about the “correct” norms, this will often lead them to make queries to humans about how to extend/update the norms (a la active learning), instead of immediately acting under its best estimate of the extended norms. This could be further augmented by a meta-norm of (generally) requiring consent / approval from the relevant human decision-making body before revising or acting under new rules.
All of this is to say, it does feel somewhat unavoidable to me to advance some kind of claim about the precise constents of a superior moral framework for what systems ought to do, beyond just matching what people do (in Russell’s case) or what society does (in this post’s case).
I’m not suggesting that AI systems should simply do what society does! Rather, the point of the contractualist framing is that AI systems should be aligned (in the limit) to what society would agree to after rational / mutually justifiable collective deliberation.
Current democratic systems approximate this ideal to a very rough degree, and I guess I hold out hope that under the right kinds of epistemic and social conditions (freedom of expression, equality of interlocutors, non-deluded thinking), the kind of “moral progress” we instinctively view as desirable will emerge from that form of collective deliberation. So my hope is that rather than specify in great degree what the contents of “superior moral theory” might look like, all we need to align AI systems with is the underlying meta-ethical framework that enables moral change. See Anderson on How Common Sense Can Be Self-Critical for a good discussion of what I think this meta-ethical framework looks like.
Hmm, I’m confused—I don’t think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn’t be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well.
Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it’s not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment?
FWIW, I’m less concerned about inner alignment because:
I’m more optimistic about model-based planning approaches that actually optimize for the desired objective in the limit of the large compute (so methods more like neurally-guided MCTS a.k.a AlphaGo, and less like offline reinforcement learning)
I’m more optimistic about methods for directly learning human interpretable, modular, (neuro)symbolic world models that we can understand, verify, and edit, and that are still highly capable. This reduces the need for approaches like Eliciting Latent Knowledge, and avoids a number or pathways toward inner misalignment.
I’m aware that these are minority views in the alignment community—I work a lot more on neurosymbolic and probabilistic programming methods, and think they have a clear path to scaling and providing economic value, which probably explains the difference.
Agreed that the interpreting law is hard, and the “literal” interpretation is not enough! Hence the need to represent normative uncertainty (e.g. a distribution over multiple formal interpretations of a natural language statement + having uncertainty over what terms in the contract are missing), which I see the section on “Inferring roles and norms” as addressing in ways that go beyond existing “reward modeling” approaches.
Let’s call the above “wilful compliance”, and the fully-fledged reverse engineering approach as “enlightened compliance”. It seems like where we might disagree is how far “wilful compliance” alone will take us. My intuition is that essentially all uses of AI will have role-specific / use-specific restrictions on power-seeking associated with them, and these restrictions can be learned (from eg human behavior and normative judgements, incl. universalization reasoning) as implied terms in the contracts that govern those uses. This would avoid the computational complexity of literally learning everyone’s preferences / values, and instead leverage the simpler and more politically feasible mechanisms that humans use to cooperate with each other and govern the commons.
I can link to a few papers later that make me more optimistic about something like the approach above!
What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment
On the contrary, I think there exist large, complex, symbolic models of the world that are far more interpretable and useful than learned neural models, even if too complex for any single individual to understand, e.g.:
- The Unity game engine (a configurable model of the physical world)
- Pixar’s RenderMan renderer (a model of optics and image formation)
- The GLEAMviz epidemic simulator (a model of socio-biological disease spread at the civilizational scale)
Humans are capable of designing and building these models, and learning how to build/write them as they improve their understanding of the world. The difficult part is how we can recapitulate that ability—program synthesis is only in its infancy in it’s ability to do so, but IMO contemporary end-to-end deep learning methods seem unlikely to deliver here if want both interpretability and usefulness.
Adding some thoughts as someone who works on probabilistic programming, and has colleagues who work on neurosymbolic approaches to program synthesis:
I think a lot of Bayes net structure learning / program synthesis approaches (Bayesian or otherwise) have the issue of uninformative variable names, but I do think it’s possible to distinguish between structural interpretability and naming interpretability, as others have noted.
In practice, most neural or Bayesian program synthesis applications I’m aware of exhibit something like structural interpretability, because the hypothesis space they live in is designed by modelers to have human-interpretable semantic structure. Two good examples of this are the prior over programs that generate handwritten characters in Lake et al (2015), and the PCFG prior over Gaussian Process covariance kernels in Saad et al (2019). See e.g. Figure 6 on how you perform analysis on programs generated by this prior, to determine whether a particular timeseries is likely to be periodic, has a linear trend, has a changepoint, etc.
Regarding uninformative variable names, there’s ongoing work on using natural language to guide program synthesis, so as to come up with more language-like conceptual abstractions (e.g. Wong et al 2021). I wouldn’t be surprised if these approaches could also be extended to come up with informative variable and function names / comments. A related line of work is that people are starting to use LLMs to deobfuscate code (e.g. Lachaux et al 2021), and I expect the same techniques will work for synthesized code.
For these reasons, I’m more optimistic about the interpretability prospects of learning approaches that generate models or code that look like traditional symbolic programs, relative to end-to-end deep learning approaches. (Note that neural networks are also “symbolic programs”, just written with a more restricted set of [differentiable] primitives, and typically staying within a set of widely used program structures [i.e. neural architectures]).
The more difficult question IMO is whether this interpretability comes at the cost of capabilities. I think this is possibly true in some domains (e.g. learning low-level visual patterns and cues), but not others (e.g. learning the compositional structure of e.g. furniture-like objects).- 9 Sep 2022 4:53 UTC; 1 point) 's comment on What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment by (
I haven’t seen compelling (to me) examples of people going successfully from psychology to algorithms without stopping to consider anything whatsoever about how the brain is constructed .
Some recent examples, off the top of my head!
One reason it’s tricky to make sense of psychology data on its own, I think, is the interplay between (1) learning algorithms, (2) learned content (a.k.a. “trained models”), (3) innate hardwired behaviors (mainly in the brainstem & hypothalamus). What you especially want for AGI is to learn about #1, but experiments on adults are dominated by #2, and experiments on infants are dominated by #3, I think.
I guess this depends on how much you think we can make progress towards AGI by learning what’s innate / hardwired / learned at an early age in humans and building that into AI systems, vs. taking more of a “learn everything” approach! I personally think there may still be a lot of interesting human-like thinking and problem solving strategies that we haven’t figured out to implement as algorithms yet (e.g. how humans learn to program, and edit + modify programs and libraries to make them better over time), that adult and child studies would be useful in order to characterize what might even be aiming for, even if ultimately the solution is to use some kind of generic learning algorithm to reproduce it. I also think there’s this fruitful in-between (1) and (3), which is to ask, “What are the inductive biases that guide human learning?”, which I think you can make a lot of headway on without getting to the neural level.
This was a great read! I wonder how much you’re committed to “brain-inspired” vs “mind-inspired” AGI, given that the approach to “understanding the human brain” you outline seems to correspond to Marr’s computational and algorithmic levels of analysis, as opposed to the implementational level (see link for reference). In which case, some would argue, you don’t necessarily have to do too much neuroscience to reverse engineer human intelligence. A lot can be gleaned by doing classic psychological experiments to validate the functional roles of various aspects of human intelligence, before examining in more detail their algorithms and data structures (perhaps this time with the help of brain imaging, but also carefully designed experiments that elicit human problem solving heuristics, search strategies, and learning curves).
I ask because I think “brain-inspired” often gets immediately associated with neural networks, and not say, methods for fast and approximate Bayesian inference (MCMC, particle filters), which are less the AI zeitgeist nowadays, but still very much how cognitive scientists understand the human mind and its capabilities.
https://onlinelibrary.wiley.com/doi/full/10.1111/tops.12137
It seems to me that it’s not right to assume the probability of opportunities to trade are zero?
Suppose both John and David are alive on a desert island right now (but slowly dying), and there’s a chance that a rescue boat will arrive that will save only one of them, leaving the other to die. What would they contract to? Assuming no altruistic preferences, presumably neither would agree to only the other person being rescued.
It seems more likely here that bargaining will break down, and one of them will kill off the other, resulting in an arbitrary resolution of who ends up on the rescue boat, not a “rational” resolution.