Jozdien comments on Trying to isolate objectives: approaches toward high-level interpretability

Jozdien 10 Jan 2023 19:25 UTC
2 points
0
I’m glad you liked the post, thanks for the comment. :)
I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like “can even be pointed at some specific objective, let alone a ‘good’ objective”. Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up—privately—with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like “design the system to even have an explicit control mechanism”, and “make it possible to change the objective/destination without needing to understand or change the engine”.)
I don’t have strong takes here about what possible training procedures and architectures that actually work outside the deep learning paradigm would look like, but naively it feels like any system where objectives are complex will still involve high-dimensional interface mechanisms to interact with them that we won’t fully understand.
Within the deep learning paradigm, GPTs seem like the archetype for something like this, as you said—you can train a powerful world model that doesn’t have an objective in any relevant sense and apply some conditional you want (like a simulacra with a specific objective), but because you’re interfacing with a very high-dimensional space to impart high-dimensional desires, the non-formalism seems like more a feature than a bug.
The closest (that I’m aware) we can get to doing anything like “load a new objective at runtime” is by engineering prompts for LLMs; but that provides a rather underwhelming level of control.
I think done right, it actually provides us a decent amount of control—but that it’s often pretty unintuitive how to exert control, especially at higher degrees and precision because we have to have a really strong feel for what the prior it learns is and what kinds of posteriors you could get with some conditional.
(It’s a slightly different problem then though, because you’re not dealing with swapping out a new model objective, rather you’re swapping out different simulacra with different goals.)
What do you think; does that seem worth thinking about?
I think there are a few separate ideas here worth mentioning. I disagree with that deep learning is practically hopeless for building training procedures that actually result in some goal we want—I think it’s really hard, but that there are plausible paths to success. Related to modularity for example, there’s some work currently being done on modularizing neural networks conceptually from the ground-up, sort of converting them into forms with modular computational components (unlike current neural networks where it’s hard to call a neuron or a weight the smallest unit of optimization). The holy grail of this would plausibly involve a modularized component for “objective” if that’s present in the model at all.
I expect that for better or worse, deep learning will probably be how we get to AGI, so I’m sceptical that thinking about new approaches to building AI outside it would yield object-level progress; it might be pretty useful in terms of illuminating certain ideas though, as a thought exercise.
In general? I think that going down this line of thought (if you aren’t very pessimistic about deep learning) would plausibly find you working on interesting approaches to hard parts of the problem (I can see someone ending up with the kind of modularity approach above with this, for example) so it seems worth thinking about in absolute terms—in relative terms though, I’m not sure how it compares to other generators.
- rvnnt 11 Jan 2023 11:05 UTC
  2 points
  0
  Parent
  Thanks for the thoughtful response, and for the link to the sequence on modularity (hadn’t seen that before). Will digest this.