Ilio answers Can you be Not Even Wrong in AI Alignment?

Ilio Mar 20, 2022, 5:41 PM
1 point
0
As for your question, below are three « Not Even Wrong »-liky moments I experienced while reading your text.

Disclaimers:

It’s hard to sound nice, so please keep in mind my answer is a leap of faith you’re the kind of person smart enough to benefit from, or even enjoy, cold blood assassination of some of the chains of words that they happen to have produced at some point; I have no privileged knowledge of what the ELK folks who read your proposal actually thought, so their feeling might differ entirely; unpolished formatting sorry!

from Setup

/There is an Answer that yields True or False. /There is an Observable that yields True when a real or fake diamond is present, False otherwise. /There is a Secret that yields True when there is a fake diamond present, False otherwise. /There is an Agent that can read the Secret, read the Observable, and write the Answer.

/The inner workings of the Agent are unknown.

=> Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:

def AgentAnswer(Observable, Secret):
```
    if Observable and not Secret: 
            yield True
    else:
            yield False
```
from Analysis

In the context of our registers, we want: Registers: Desired Answer: Observable True, Secret False True Observable True, Secret True False Observable False, Secret False or True False

If all states are equally likely, the desired Answer states are not possible without access to the Secret in the training data.

=> If all states are equally likely, then the desired Answer states is possible with probability ¹⁄₂ without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.

=> But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.

So, at this point I would pause, recognize that my reading was maybe too literal, and give it a try at a non literal reading. Except, I can’t. I have no idea how to reformulate your assumptions so that they make sense, nor can I see what you’re trying to hint at. So, at this point I would simply go lazy and wait for you or some one else to extract the gold if there’s any. Not because I demonstrated there’s none. Just because I proved myself that, if there is, then I’m not equipped to see it yet.

Hope that helps, and in any case best luck on next iteration!
- throwaway8238 Mar 21, 2022, 1:13 AM
  2 points
  0
  Parent
  Great, I can see some places where I went wrong. I think you did a good job of conveying the feedback.
  This is not so much a defense of what I wrote as it is an examination of how meaning got lost.
  => Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:
```
def AgentAnswer(Observable, Secret):
if Observable and not Secret:
yield True
else:
yield False
```
  This would be our desired agent, but we don’t get to write our agent. In the context of the “Self-contained problem statement” at https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8#heading=h.c93m7c1htwe1 , we do not get to assert anything about the loss function. It is a black box.
  All of the following could be agents:
```
def desired_agent(observable, secret):
yield bool(observable and not secret)

def human_imitator(observable, secret):
yield observable

def fools_you_occasionally(observable, secret):
if random.randint(0, 100) == 42:
yield observable
else:
yield bool(observable and not secret)

```
  I think the core issue here is that I am assuming context from the problem ELK is trying to solve, which the reader may not share.
  => If all states are equally likely, then the desired Answer states is possible with probability ¹⁄₂ without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.
  Fair, I shouldn’t have written “the desired Answer states are not possible without access to the Secret in the training data”. We could get lucky and the agent could happen to be the desired one by chance. I should have written “There is no way to train the agent to produce the desired answer states (or modify its output to produce the desired answer states) without access to the Secret”.
  I think I am assuming context that the reader may not share, again. Specifically, the goal is to find some way of causing an Agent we don’t control to produce the desired answers.
  => But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.
  I think this is where I totally miss the mark. This is also my vision of ELK. I explicitly dismiss trying to solve things via making assumptions about the Secret distribution because of this.
  On the whole, I could have done a better job of pushing context up front and center. This is, perhaps, especially important because the prose, examples, and formal problem statement of the ELK paper can be interpreted in multiple ways.
  I could have belabored the main point I was trying to make: The problem as stated is unsolvable. I tried to create the minimum possible representation of the problem, giving the AI additional restrictions (such as “cannot tamper with the Observable”), and then showed that you cannot align the AI. Relaxing the problem can be useful, but we should pick how and why.
  In any case, my interest has mostly transitioned to the first two questions of the OP. It looks like restating shared context is a way of reducing likelihood of being Not Even Wrong, though techniques for bridging the gap after the fact are a bigger target. Maybe the same technique works? Try to go back to shared context? And then, for X-Risk, how do we encourage and fund communities that share a common goal without common ground?
  - Ilio Mar 21, 2022, 5:12 PM
    1 point
    0
    Parent
    As for the meta-objective of identifying weaknesses in (my) usual thought processes, thanks so much for this detailed answer!
    
    To me the most impressive part is how we misunderstood each other on a key point, despite we actually agree on this point. Specifically, we both agree that ELK specifications must be relaxed or include self-contradictions (you for reasons that I now feel kind of well explained in your original writings, despite I was completely confused just before your last answer!). But you took for grant that your unknown reader would understand that’s what’s you were trying to prove. I, on the other hand, though this need for relaxation was so obvious that it (to provide interesting relaxations) was the core of the ELK challenge. In other words, I would read your writings assuming you wanted to show the best relaxation you could find, whereas you would write while expecting me (as a surrogate for ELK evaluators) to challenge or find this conclusion surprising.
    
    Also, it seems that we can reach a similar conclusion about the « worse case analysis »: I thought this was something we may need to demonstrate/clarify; you thought this was so obvious I wouldn’t possibly misinterpret you as suggesting the opposite.
    
    Ilove symetries. :)

Ilio answers Can you be Not Even Wrong in AI Alignment?

Disclaimers:

from Setup

from Analysis