I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don’t currently know of any good writeup. Linkpost for the first part is here; this linkpost is for the second part.
Compared to the first part, the second part has less material which has not been written up already, although it does do a better job tying it all into the bigger picture than any already-written source. I will link to relevant posts in the outline below.
Major pieces in part two:
Potentially allows models of worlds larger than the data structure representing the model, including models of worlds in which the model itself is embedded.
Can’t brute-force evaluate the whole model; must be a lazy data structure with efficient methods for inference
The Pointers Problem: the inputs to human values are latent variables in humans’ world models
This is IMO the single most important barrier to alignment
Other aspects of the “type signature of human values” problem (just a quick list of things which I’m not really the right person to talk about)
Abstraction (a.k.a. ontology identification)
Three roughly-equivalent models of natural abstraction
Summary (around 1:30:00 in video)
I ended up rushing a bit on the earlier parts, in order to go into detail on abstraction. That was optimal for the group I was presenting to at the time I presented, but probably not for most people reading this. Sorry.
Here’s the video:
Again, big thanks to Rob Miles for editing! (Note that the video had some issues—don’t worry, the part where the camera goes bonkers and adjusts the brightness up and down repeatedly does not go on for very long.) The video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
This is great. One question it raises for me is: Why is there a common assumption in AI safety that values are a sort of existent (i.e., they exist) and latent (i.e., they are not directly observable) phenomena? I don’t think those are unreasonable partial definitions of “values,” but they’re far from the only ones and not at all obvious that they’re the values with which we want to align AI. Philosophers Iason Gabriel (2020) and Patrick Butlin (2021) have pointed out some of the many definitions of “values” that we could use for AI safety.
I understand that just picking an operationalization and sticking to it may be necessary for some technical research, but I worry that the gloss reifies these particular criteria and may even reify semantic issues (e.g., Which latent phenomena do we want to describe as “values”?; a sort of verbal dispute a la Chalmers) incorrectly as substantive issues (e.g., How do we align an AI with the true values?).
Some thoughts on this question that you mention briefly in the talk.
I think that evolution selects for functional decision theory (FDT). More specifically, it selects the best policy over a life time, and not the best action in a given situation. I don’t mean that we actually cognitively calculate FDT, but that there is an evolutionary pressure to act as if we follow FDT
Example: Revenge
By revenge I mean burning some of your utility just to get back at someone who hurt you.
Revenge is equivalent to transparent Newcomb’s problem. You can see that Omega has predicted that you will two-box, i.e. the box that could have lots of money is empty. What do you do? You can one-box anyway, and counterfactually make this situation less likely, but giving up the smaller reward too (take revenge) or you can accept that you can’t change the past, cut your losses, and just take the smaller reward (no revenge).
The way this is evolutionally encoded in humans is not a tendency to think about counterfactual situations. Instead we get mad at Omega for withholding the larger reward, and we one-box out of spite, to make Omegas prediction wrong, forcing a loose-loose outcome. But it is still FDT in practice.
Taking revenge can also be justified causally, if it it’s about upholding your reputation so no one crosses you again. Humans defiantly do this calculation too. But it seems like most humans have a revenge drive that is stronger than what CDT would recommend, which is why I think this example backs up my claim that evolution selects for FDT.
My claim is of a similar type as Caspar’s Doing what has worked well in the past leads to evidential decision theory. It’s a statement about the resulting policy, not the reasoning steps of the agent. Caspar describes an agent that does the action that has worked well in the past. Evolution is a process that select the policy that has worked well in the past, which should give you effectively some version of FDT-son-of-EDT.
There are situations where FDT behave differently depending on when and why it was created (e.g.). I think I could figure out how this would playout out in the context of evolution, but it would take some more thinking. If you think I’m on the right track, and convince me that this is useful, I’ll give it a try.
This argument sounds roughly right to me, though I’m not sure FDT is exactly the right thing. If two organisms were functionally identical but had totally different genomes, then FDT would decide as though they’re one unit, whereas I think evolution would select for deciding as though only the genetically-identical organisms are a unit? I’m not entirely sure about that.
I agree that it’s not exactly FDT. I think I actually meant updateless decision theory (UDT), but I’m not sure because I have some of uncertainty to exactly what others mean by UDT.
I claim that mutations + natural selection (evolution) selects for agents that acts according to the policy they would have wanted to pre-commit to, at the time of their birth (last mutation).
Yes, there are some details around who I recognize as a copy of me. In classical FDT this would be anyone who are running the same program (what ever that means). In evolution this would be anyone who are carrying the same genes. Both of these concept are complicated by “same program” and “same genes” are scalar (or more complicated?) and not Boolean values.Edit: I’m not sure I agree with what I just said. I believe something in this direction, but I want to think some more. For example, people with similar genes probably don’t cooperate because decision theory (my decision to cooperate with you is correlated with your decision to cooperate with me), but because shared goals (we both want to spread our shared genes).
Is there a transcript of this anywhere? Does this have new content on top of the links, or is it “merely” a presentation containing the content in the links?
No transcript yet. There’s only a few minor things which I would consider new content, though the links do not explain much about how it all connects to the bigger picture.
My immediate reaction[1] to:
...is “beware impossibility results from computability theory”. Especially with multiple agents.
(I haven’t yet watched said video, and am unlikely to.)
One thing I can’t help but think is:
So obviously as you mention, the whole thing about taking the infinite limits etc. is meant to be a hypothetical stand-in for doing things at a large scale. And similarly, obviously we don’t have finite observations, so it’s also an idealization there.
But this makes me think that perhaps some useful insights about the range of applicability for abstractions could be derived by thinking about convergence rates. E.g. if the information in an abstraction is spread among n variables, then each “layer” (e.g. Markov blanket, resampling) could be expected to introduce noise on the scale of n−1/2, so that seems to suggest that the abstraction is only valid up to a distance of around 1n−1/2=n1/2.
I’m not really sure how to translate this into practical use, because it seems like it would require some conversion factor between variable count and distance. I guess maybe one could translate it into a comparative rule, like “Getting X times more observations of a system should allow you to understand it in a √X times broader setting”, but this is probably a mixture of being too well-known, too abstract or too wrong to be useful.
But regardless of whether I’ll come up with any uses for this, I’d be curious if you or anyone else has any ideas here.
Yeah, the general problem of figuring out approximations and bounds for finite versions of these theorems has been a major focus for me over the past couple weeks, and will likely continue to be a major focus over the next month. Useful insights have already come out of that, and I expect more will come.
Thought: at one point, you talk about taking a general (not necessarily causal, infinite, or anything like that) distribution and applying the resampling process to this, leading to capturing the redundant information.
But does that necessarily work? I’d think that if e.g. your distribution was a multivariate Gaussian with nondeterministic correlations, you’d get regression to the mean, such that the limit of the resampling process just makes you end up with the mean. But this means that there’s no information left in the limiting variable.
I think what goes wrong is that if you start resampling the multivariate Gaussian, you end up combining two effects: blurring it (which is what you want, to abstract out nonlocal stuff), and dissipating it to the mean (which is what you don’t want). As long as you haven’t removed all the variance in the dissipation, the blur will still capture the information you want. But as you take the limit, the variance goes to zero and that prevents it from carrying any information.
In the Gaussian case specifically, you can probably solve that by just continually rescaling as you take the limit to keep the variance high, but I don’t know if there is a solution for e.g. discrete variables.
Wait, no, it’s resampling, not regression. So you introduce noise underway, which means if you only have a finite set of imperfectly correlated variables, the mutual information should drop to zero.