Imaginary John: Well, uh, these days I’m mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that’s where I think the hope is. So I’m not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).
That’s not actually what I spend most of my time on, it’s just a thing which came up in conversation with Eliezer that one time. I’ve never actually spent much time on a descriptive account of human values; I generally try to work on things which are bottlenecks to a wide variety of strategies (i.e. convergent hard subproblems), not things which are narrowly targeted to a single strategy.
What I’m actually spending most of my time on right now is figuring out how abstractions end up represented in cognitive systems, and how those representations correspond to structures (presumably natural abstractions) in the environment. In particular, I’d like to say things about convergent representations, such that we can both (a) test the claims on a wide variety of existing systems, and (b) have theorems saying that the claims extend to new kinds of systems.
… which, amusingly, looks like a much more ambitious version of interpretability work.
I know this is a necro bump, but could you describe the ambitious interp work you have in mind?
Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.
That’s not actually what I spend most of my time on, it’s just a thing which came up in conversation with Eliezer that one time. I’ve never actually spent much time on a descriptive account of human values; I generally try to work on things which are bottlenecks to a wide variety of strategies (i.e. convergent hard subproblems), not things which are narrowly targeted to a single strategy.
What I’m actually spending most of my time on right now is figuring out how abstractions end up represented in cognitive systems, and how those representations correspond to structures (presumably natural abstractions) in the environment. In particular, I’d like to say things about convergent representations, such that we can both (a) test the claims on a wide variety of existing systems, and (b) have theorems saying that the claims extend to new kinds of systems.
… which, amusingly, looks like a much more ambitious version of interpretability work.
I know this is a necro bump, but could you describe the ambitious interp work you have in mind?
Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.