[Question] How is ARC planning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM

24 points

AI Eliciting Latent Knowledge Alignment Research Center (ARC)MATS Program

Let’s say we arrive at a worst-case solution for ELK, how are we planning to use it? My initial guess was that ELK is meant to help make IDA viable so that we may be able to use it for some automated alignment-type approach. However, this might not be it. Can someone clarify this? Thanks.

jacquesthibsDec 15, 2022, 8:11 PM

24 points

5 comments1 min readLW link

AI Eliciting Latent Knowledge Alignment Research Center (ARC)MATS Program

paulfchristiano Dec 16, 2022, 5:10 PM
13 points
0

ARC’s hope is to train AI systems with the loss function “How much do we like the actions proposed by this system?”, in order to produce AI systems that take actions we like.
If human overseers know everything the model knows then this is just RLHF. However we are concerned about cases where the model understands something that the humans do not. In that case, we hope to use ELK to elicit key information that will help us understand the consequences of the AI’s action so that we can decide whether they are good.
(The same difficulty would arise if you were trying to train AI systems to evaluate actions and then searching against those evaluations, which is the case we discuss in the ELK report.)
What links here?
- Highlights and Prizes from the 2021 Review Phase by Raemon (Jan 23, 2023, 9:41 PM; 38 points)
- Orpheus16's comment on ARC’s first technical report: Eliciting Latent Knowledge by paulfchristiano (Jan 16, 2023, 7:38 AM; 19 points)
- the gears to ascension Dec 17, 2022, 9:07 AM
  3 points
  1
  Parent
  
  what happens if this finds a way to satisfy values that the human actually has, but would not have if they had been able to do ELK on their own brain? eg, for example, I’m pretty sure I don’t want to want some things I want, and I’m worried about s-risks from the scaled version of locking in networks of conflicting things people currently truly want but truly wouldn’t want to truly want. eg, I’m pretty sure mine are milder than this, but some people truly want to hurt others in ways the other doesn’t want order to get ahead, and would resist any attempt to remove hurting others. given that these people can each have their own ai amplifier, what tool can be both probabilistically-verifiably trustable but also help both ai and human mutually discover ways to be aligned with others that neither could have discovered on their own?
  - paulfchristiano Dec 17, 2022, 4:33 PM
    7 points
    3
    Parent
    
    I’ll be happy if AI gives people time/space/safety to figure out what they want while taking actions in the world that preserve option value.
    The kind of AI alignment solution we’re working on isn’t a substitute for people deciding how they want to reflect and develop and decide what they value. The idea is that if AI is going to be part of that process, then the timing and nature of AI involvement should be decided by people rather than by “we need to deploy this AI now in order to remain competitive and accept whatever affects that has on our values.”
    You could imagine AI solutions that try to replace the normal process of moral deliberation and reconciliation (rather than simply being a tool to help it), but I’ve never seen a proposal along those lines that didn’t seem really bad to me.
lberglund Dec 16, 2022, 4:40 PM
2 points
0

The ELK report has a section called “Indirect normativity: defining a utility function” that sketches out a proposal for using ELK to help align AI. Here’s an excerpt:
Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?
The natural approach is to ask our AI “How good are the consequences of action A?” but that’s way outside the scope of “narrow” ELK as described in Appendix: narrow elicitation.
Even worse: in order to evaluate the goodness of very long-term futures, we’d need to know facts that narrow elicitation can’t even explain to us, and to understand new concepts and ideas that are currently unfamiliar. For example, determining whether an alien form of life is morally valuable might require concepts and conceptual clarity that humans don’t currently have.
We’ll suggest a very different approach:
1. I can use ELK to define a local utility function over what happens to me over the next 24 hours. More generally, I can use ELK to interrogate the history of potential versions of myself and define a utility function over who I want to delegate to—my default is to delegate to a near-future version of myself because I trust similar versions of myself, but I might also pick someone else, e.g. in cases where I am about to die or think someone else will make wiser decisions than I would.
2. Using this utility function, I can pick my “favorite” distribution over people to delegate to, from amongst those that my AI is considering. If my AI is smart enough to keep me safe, then hopefully this is a pretty good distribution.
3. The people I prefer to delegate to can then pick the people they want to delegate to, who can then pick the people they want to delegate to, etc. We can iterate this process many times, obtaining a sequence of smarter and smarter delegates.
4. This sequence of smarter and smarter delegates will gradually come to have opinions about what happens in the far future. Me-of-today can only evaluate the local consequences of actions, but me-in-the-future has grown enough to understand the key considerations involved, and can thus evaluate the global consequences of actions. Me-of-today can thus define utilities over “things I don’t yet understand” by deferring to me-in-the-future.
Charlie Steiner Dec 15, 2022, 11:43 PM
2 points
−1

I would instead say that ELK is a component of getting good human feedback, and the more ambitious the ELK is (e.g. requiring the human to actually understand something correctly for it to count as reported), the more it’s sort of trying to do all the work needed to get good human feedback (which may involve a lot of subtle work), and making it so that you only need a very simple wrapper around it (e.g. approval-directedness, or RLHF, or human-guided self-modification, or automated alignment research) to get good outcomes.

No comments.