TurnTrout comments on Research Hamming Questions

TurnTrout 3 Apr 2022 17:48 UTC
27 points
One of my favorite things about you, John, is that you are excellent at prompting me to direct my attention towards important questions which lead to promising insights. Thanks for that!
I answered your questions, originally in my private notes, but partway through decided I would comment them.
Imagine that your field achieved perfection—the ultimate theory, perfect understanding, building The Thing.
What has been achieved in the idealized version of the field, which has not yet been achieved today? What are the main barriers between here and there?
- Detailed understanding of what it means for one agent to be aligned with another agent, or a group of agents.
- We can easily check the functional properties of a causal process and argue they satisfy theorems saying WHP it veers towards highly desirable states.
  - Like, I could point out the ways in which DRL fails at a glance, without appealing to instrumental convergence in particular?
    Or maybe that’s part of the theory?
- These theorems are intuitively obviously correct and corresponding to original-intuitive-reality.
- They are so correct that it’s easy to persuade people to use the aligned approach.
- We understand what agents are, and how people fit into that picture, and the theory retrodicts all past problems with governments, with corporations, with principal agents
- We know how to take any reasonable training process and make it an aligned training process, with minimal alignment tax (<5%)
Barriers:
- We don’t know what agents are.
- We don’t know what alignment means
- We don’t know how to prove the right kinds of theorems
- We don’t know if our concepts are even on the right track
  - They probably aren’t, except insofar as they spur the right language. It feels more like “how do I relate KL div and optimality probability” rather than “let’s prove theorems about retargetability”
Often, in hindsight, a field turns out to have been bottlenecked on the development of some new measurement method, ranging from physical devices like the thermometer to abstract ideas like Shannon’s entropy and information channel capacity.

In what places does it look like your field is bottlenecked on the ability to usefully measure something? What are the main barriers to usefully measuring those things?
- Bottlenecks
  - Abstraction
    Uninterpretable
    Not sure where to draw “category boundaries”
  - Alignment
    Don’t know what alignment really is
    Or how to measure ground truth
    Needs vocab of concepts we don’t have yet
  - Power-seeking
    Unclear what gets trained and how to measure according to distributions
    Would need current agent beliefs to use current formalism
    Also current formalism doesn’t capture game theoretic aspects of logical blackmail etc
    Intent seems more important
    But this is bottlenecked on “what is going on in the agent’s mind”
  - Interpretability
    I’m not sure. Networks are big and messy.
  - Capability
    Actually I bet this is natural in terms of the right alignment language
  - Compute efficiency in terms of ability to bring about cognitively important outcomes
    Seems strictly harder than “capability”
  - Time remaining until TAI
    Uncertainty about how AI works, how minds work, what the weighted-edge distance is on lattice of AI discoveries
- Barriers
  - Conceptual roadblocks
    But what?
    (Filled in above)
What are the places where your field is flailing around in the dark, trying desperate ideas and getting unhelpful results one after another? What are the places where it feels like the problem is formulated in the wrong language, and a shift to another frame might be required to ask the right question or state the right hypothesis?
- Flailing:
  - IDA
  - ELK
  - Everything theoretical feels formulated wrong, except maybe logical induction / FFS / John’s work
    This is important!
    (Also I wouldn’t be surprised if I’d say Vanessa’s work is not flailing, if I could understand it)
  - Retargetability → IC seems like an important piece but not the whole thing, part of it is phrased correctly
  - AUP was flailing
    Did I get impact wrong? Or is reward maximization wrong?
    I think I got impact right philosophically, but not the structure of how to get one agent to properly care about impact on other agents.
    I just found a good trick (penalize agent for impact to other goals it could have and pursue) which works really well in a range of cognitively available situations (physical proximity) but which breaks down under tons of optimization pressure
    And the “don’t gain power for your own goal” seems like it should be specifiable and non-hacky, but I don’t actually see how to do it right.
    But note that getting impact right for-real wouldn’t save the world AFAICT
  - Basically everything else
- What happened when I shifted to the retargetability frame?
  - IDT i did that until recently, actually; original post was too anchored on instrumental convergence over outcome sets, missing the elegant functional statement
  - and my shift to this frame still feels incomplete.
- Corrigibility still feels like it should work in the right language and grounding
Sometimes, we have a few different models, each of which works really well in different places. Maybe it feels like there should be some model which unifies them all, which could neatly account for all these phenomena at once—like the unification of electricity, magnetism and optics in the 19th century.

Are there different models in your field which feel like they point to a not-yet-known unified model?
- I guess different decision theories? Not super familiar
- Not coming up with as many thoughts here, because I feel like our “partial models” are already contradicted and falsified on their putative domains of applicability, so what good would a unification do? More concisely stated wrongness?
One of the main ways we notice (usually implicit) false assumptions in our models is when they come into conflict with some other results, patterns or constraints. This may look like multiple models which cannot all be true simultaneously, or it may look like one model which looks like it cannot be true at all yet nonetheless keeps matching reality quite well. This is a hint to reexamine the assumptions under which the models are supposedly incompatible/impossible, and especially look for any hidden assumptions in that impossibility argument.

Are there places in your field where a few models look incompatible, or one model looks impossible, yet nonetheless the models match reality quite well?
- Tempted to say “no”, because of the last phrase in the last sentence not seeming true.
- Here was one. The instrumental convergence theorems required a rather precise environmental symmetry, which seemed weird. But now I have a new theory which relates abstract functional properties of how events come about, to those events coming out similarly for most initial conditions. And that doesn’t have anything to do with environmental / outcome-level symmetries. So at first the theory was right in its domain of applicability, but the domain seemed absurdly narrow on a few dimensions.
  - it was narrow not because I missed how to prove sufficiently broad theorems about MDPs, but because I was focusing on the wrong details and missing the broader concept underlying everything I’d observed.
- I guess the impossibility of value learning seems correct but spiritually inapplicable to the problem we want to solve, but I don’t quite know how to articulate that.
- A few months back I wrote about how corrigibility is often impossible under reward maximization. But reward maximization seems pretty useful for motivating agents. But it’s so so so broken for nontrivial kinds of motivation.
The space of possible physical laws or theorems or principles is exponentially vast. Sometimes, the hard part is to figure out what the relevant factors are at all. For instance, to figure out how to reproducibly culture a certain type of cell, a biologist might need to provide a few specific signal molecules, a physical medium with the right elasticity or density, a particular temperature range, and/or some other factors which nobody even thought to test yet.

Are there places in your field where nobody even knows what key factors must be controlled for some important outcome to robustly occur?
- In a robotics task, how would we ensure test-time agents did anywhere between 2 and 10 jumping jacks in an episode?
  - What factors would you control there? We don’t know how to “target” along these dimensions, at least it would take more effort than I think it should
Are there places in your field where some concept seems very central to understanding, but nobody knows its True Name yet?
- Corrigibility
  - Corrigibility
  - Corrigibility
- Alignment
- I remember thinking there was a concept like this related to logical time, but I forget what it was
At the social level, what are the barriers to solving the main problems in the previous two questions? Why aren’t they already solved? Why isn’t progress being made, or made faster?
- Bad incentives — I think people working on eg value learning should not do that
  - I think the marginal return from more researcher hours on deconfusion outweighs the return from empirical wisdom in applying known-broken paradigms—gut instinct
- People don’t know the feel of a True Name-having object, or forget how important it the name discovery is.
  - I do this sometimes.
- Or it’s really hard
- Or we need more kinds of minds looking at this problem
  - But I don’t know what this means
Are there places where your field as a whole, or you personally, pursue things which won’t really help with the main problems (but might kind of “look like” they address the problems)?
- Field as a whole
  - Value learning seems doomed, why are people still working on it
  - There are non-core researchers working in ways which don’t make sense to me, but I don’t remember who they are or what they were working on
    They’re more field-adjacent
  - I used to be more excited about IDA-style insights but now I feel queasy about skipping over the hard parts of a problem without really getting insights about how alignment works
    This is a lesson which I took too long to learn, where I was too tolerant of finding “clever” ways to box my uncertainty. Sequences article about this but i dont wanna go find it right now
    What kind of person can I become who would notice this error on my own, before making it, before hearing that this is a failure mode?
    Anyways.
- Me
  - I think impact measurement is doomed at least in the current paradigm
    In the hindsight of a perfected field, I think impact regularization would be a thing you can do robustly, but not to do a decisive act
    I’m basically done working on impact measurement though
  - I’m finishing up my PhD right now, but I think I’m doing pretty well on this axis now. I used to be bad at pica.
Pick someone you know, or a few people, who are smart and have good judgment. What would their answers to these questions be?
John, I think you would not strongly disagree with most anything I said, but I feel like you would say that corrigibility isn’t as pragmatically important to understand. Or, you might say that True-Name-corrigibility is actually downstream of the True-Name-alignment concepts we need, and it’s the epiphenomenon. I don’t know. This prediction is uncertain and felt more like I queried my John model to give a mental speech which is syntactically similar to a real-John-speech, rather than my best prediction of what you would say.