Christopher Olah

Karma: 544

Christopher Olah Nov 2, 2019, 9:18 PM
LW: 21 AF: 10
AF
in reply to: evhub’s comment on: Chris Olah’s views on AGI safety
One thing I’d add, in addition to Evan’s comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.

(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)

Christopher Olah Nov 2, 2019, 9:12 PM
LW: 39 AF: 15
AF
in reply to: Ben Pace’s comment on: Chris Olah’s views on AGI safety
I think that’s a fair characterization of my optimism.

I think the classic response to me is “Sure, you’re making progress on understanding vision models, but models with X are different and your approach won’t work!” Some common values of X are not having visual features, recurrence, RL, planning, really large size, and language-based. I think that this is a pretty reasonable concern (more so for some Xs than others). Certainly, one can imagine worlds where this line of work hits a wall and ends up not helping with more powerful systems. However, I would offer a small consideration in the other direction: In 2013 I think no one thought we’d make this much progress on understanding vision models, and in fact many people thought really understanding them was impossible. So I feel like there’s some risk of distorting our evaluation of tractability by moving the goal post in these conversations.

I’m not surprised by other people feeling like they have less traction. I feel like the first three or so years I spent trying to understand the internals neural networks involved a lot of false starts with approaches that ended up being dead ends (eg. visualizing really small networks, or focusing on dimensionality reduction). DeepDream was very exciting, but it retrospect I feel like it took me another two or so years to really digest what it meant and how one could really use it as a scientific tool. And this is with the benefit of amazing collaborators and multiple very supportive environments.

One final thing I’d add is that, if I’m honest, I’m probably more motivated by aesthetics than optimism. I’ve spent almost seven years obsessed with the question of what goes on inside neural networks and I find the crazy partial answers we learn every year tantalizingly beautiful. I think this is pretty normal for early research directions; Kuhn talks about it a fair amount in The Structure of Scientific Revolutions.
What links here?
- Modeling the impact of safety agendas by Ben Cottier (Nov 5, 2021, 7:46 PM; 51 points)

Christopher Olah Nov 2, 2019, 8:54 PM
LW: 22 AF: 11
AF
in reply to: Ofer’s comment on: Chris Olah’s views on AGI safety

I’m curious what’s Chris’s best guess (or anyone else’s) about where to place AlphaGo Zero on that diagram

Without the ability to poke around at AlphaGo—and a lot of time to invest in doing so—I can only engage in wild speculation. It seems like it must have abstractions that human Go players don’t have or anticipate. This is true of even vanilla vision models before you invest lots of time in understanding them (I’ve learned more than I ever needed to about useful features for distinguishing dog species from ImageNet models).

But I’d hope the abstractions are in a regime where, with effort, humans can understand them. This is what I expect the slope downwards as we move towards “alien abstractions” to look like: we’ll see abstractions that are extremely useful if you can internalize them, but take more and more effort to understand.

Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?

Yes, I believe that RL agents have a much wider range of accident concerns than supervised / unsupervised models.

Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it’s safe?

Gurkenglas provided a very eloquent description that matches why I believe this. I’ll continue discussion of this in that thread. :)