I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don’t currently know of any good writeup. Linkpost for the first part is here; this linkpost is for the second part.
Compared to the first part, the second part has less material which has not been written up already, although it does do a better job tying it all into the bigger picture than any already-written source. I will link to relevant posts in the outline below.
Potentially allows models of worlds larger than the data structure representing the model, including models of worlds in which the model itself is embedded.
Can’t brute-force evaluate the whole model; must be a lazy data structure with efficient methods for inference
The Pointers Problem: the inputs to human values are latent variables in humans’ world models
This is IMO the single most important barrier to alignment
Other aspects of the “type signature of human values” problem (just a quick list of things which I’m not really the right person to talk about)
I ended up rushing a bit on the earlier parts, in order to go into detail on abstraction. That was optimal for the group I was presenting to at the time I presented, but probably not for most people reading this. Sorry.
Here’s the video:
Again, big thanks to Rob Miles for editing! (Note that the video had some issues—don’t worry, the part where the camera goes bonkers and adjusts the brightness up and down repeatedly does not go on for very long.) The video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
The Big Picture Of Alignment (Talk Part 2)
Link post
I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don’t currently know of any good writeup. Linkpost for the first part is here; this linkpost is for the second part.
Compared to the first part, the second part has less material which has not been written up already, although it does do a better job tying it all into the bigger picture than any already-written source. I will link to relevant posts in the outline below.
Major pieces in part two:
Programs as a compressed representation for large (potentially infinite) probabilistic causal models with symmetry
Potentially allows models of worlds larger than the data structure representing the model, including models of worlds in which the model itself is embedded.
Can’t brute-force evaluate the whole model; must be a lazy data structure with efficient methods for inference
The Pointers Problem: the inputs to human values are latent variables in humans’ world models
This is IMO the single most important barrier to alignment
Other aspects of the “type signature of human values” problem (just a quick list of things which I’m not really the right person to talk about)
Abstraction (a.k.a. ontology identification)
Three roughly-equivalent models of natural abstraction
Summary (around 1:30:00 in video)
I ended up rushing a bit on the earlier parts, in order to go into detail on abstraction. That was optimal for the group I was presenting to at the time I presented, but probably not for most people reading this. Sorry.
Here’s the video:
Again, big thanks to Rob Miles for editing! (Note that the video had some issues—don’t worry, the part where the camera goes bonkers and adjusts the brightness up and down repeatedly does not go on for very long.) The video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.