paulfchristiano comments on Why I’m not working on {debate, RRM, ELK, natural abstractions}

paulfchristiano 10 Feb 2023 19:39 UTC
LW: 15 AF: 6
0
AF
I agree that ultimately AI systems will have an understanding built up from the world using deliberate cognitive steps (in addition to plenty of other complications) not all of which are imitated from humans.
The ELK document mostly focuses on the special case of ontology identification, i.e. ELK for a directly learned world model. The rationale is: (i) it seems like the simplest case, (ii) we don’t know how to solve it, (iii) it’s generally good to start with the simplest case you can’t solve, (iv) it looks like a really important special case, which may appear as a building block or just directly require the same techniques as the general problem.
We briefly discuss the more general situation of learned optimization here. I don’t think that discussion is particularly convincing, it just describes how we’re thinking about it.
On the bright side, it seems like our current approach to ontology identification (based on anomaly detection) would have a very good chance of generalizing to other cases of ELK. But it’s not clear and puts more strain on the notion of explanation we are using.
At the end of the day I strongly suspect the key question is whether we can make something like a “probabilistic heuristic argument” about the reasoning learned by ML systems, explaining why they predict (e.g.) images of human faces. We need arguments detailed enough to distinguish between sensor tampering (or lies) and real anticipated faces, i.e. they may be able to treat some claims as unexplained empirical black boxes but they can’t have black boxes so broad that they would include both sensor tampering and real faces.
If such arguments exist and we can find them then I suspect we’ve dealt with the hard part of alignment. If they don’t exist or we can’t find them then I think we don’t really have a plausible angle of attack on ELK. I think a very realistic outcome is that it’s a messy empirical question, in which case our contribution could be viewed as clarifying an important goal for “interpretability” but success will ultimately come down to a bunch of empirical research.
What links here?
- Why I’m not working on {debate, RRM, ELK, natural abstractions} by Steven Byrnes (10 Feb 2023 19:22 UTC; 71 points)
- Steven Byrnes's comment on Why I’m not working on {debate, RRM, ELK, natural abstractions} by Steven Byrnes (13 Mar 2023 13:52 UTC; 4 points)
- Steven Byrnes 10 Feb 2023 21:12 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Thanks!
  Thinking about it more, I think my take (cf. Section 4.1) is kinda like “Who knows, maybe ontology-identification will turn out to be super easy. But even if it is, there’s this other different problem, and I want to start by focusing on that”.
  And then maybe what you’re saying is kinda like “We definitely want to solve ontology-identification, even if it doesn’t turn out to be super easy, and I want to start by focusing on that”.
  If that’s a fair summary, then godspeed. :)
  (I’m not personally too interested in learned optimization because I’m thinking about something closer to actor-critic model-based RL, which sorta has “optimization” but it’s not really “learned”.)