throwaway8238 comments on Can you be Not Even Wrong in AI Alignment?

throwaway8238 Mar 21, 2022, 1:13 AM
2 points
0
Great, I can see some places where I went wrong. I think you did a good job of conveying the feedback.
This is not so much a defense of what I wrote as it is an examination of how meaning got lost.
=> Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:
```
def AgentAnswer(Observable, Secret):
if Observable and not Secret:
yield True
else:
yield False
```
This would be our desired agent, but we don’t get to write our agent. In the context of the “Self-contained problem statement” at https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8#heading=h.c93m7c1htwe1 , we do not get to assert anything about the loss function. It is a black box.
All of the following could be agents:
```
def desired_agent(observable, secret):
yield bool(observable and not secret)

def human_imitator(observable, secret):
yield observable

def fools_you_occasionally(observable, secret):
if random.randint(0, 100) == 42:
yield observable
else:
yield bool(observable and not secret)

```
I think the core issue here is that I am assuming context from the problem ELK is trying to solve, which the reader may not share.
=> If all states are equally likely, then the desired Answer states is possible with probability ¹⁄₂ without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.
Fair, I shouldn’t have written “the desired Answer states are not possible without access to the Secret in the training data”. We could get lucky and the agent could happen to be the desired one by chance. I should have written “There is no way to train the agent to produce the desired answer states (or modify its output to produce the desired answer states) without access to the Secret”.
I think I am assuming context that the reader may not share, again. Specifically, the goal is to find some way of causing an Agent we don’t control to produce the desired answers.
=> But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.
I think this is where I totally miss the mark. This is also my vision of ELK. I explicitly dismiss trying to solve things via making assumptions about the Secret distribution because of this.
On the whole, I could have done a better job of pushing context up front and center. This is, perhaps, especially important because the prose, examples, and formal problem statement of the ELK paper can be interpreted in multiple ways.
I could have belabored the main point I was trying to make: The problem as stated is unsolvable. I tried to create the minimum possible representation of the problem, giving the AI additional restrictions (such as “cannot tamper with the Observable”), and then showed that you cannot align the AI. Relaxing the problem can be useful, but we should pick how and why.
In any case, my interest has mostly transitioned to the first two questions of the OP. It looks like restating shared context is a way of reducing likelihood of being Not Even Wrong, though techniques for bridging the gap after the fact are a bigger target. Maybe the same technique works? Try to go back to shared context? And then, for X-Risk, how do we encourage and fund communities that share a common goal without common ground?
- Ilio Mar 21, 2022, 5:12 PM
  1 point
  0
  Parent
  As for the meta-objective of identifying weaknesses in (my) usual thought processes, thanks so much for this detailed answer!
  
  To me the most impressive part is how we misunderstood each other on a key point, despite we actually agree on this point. Specifically, we both agree that ELK specifications must be relaxed or include self-contradictions (you for reasons that I now feel kind of well explained in your original writings, despite I was completely confused just before your last answer!). But you took for grant that your unknown reader would understand that’s what’s you were trying to prove. I, on the other hand, though this need for relaxation was so obvious that it (to provide interesting relaxations) was the core of the ELK challenge. In other words, I would read your writings assuming you wanted to show the best relaxation you could find, whereas you would write while expecting me (as a surrogate for ELK evaluators) to challenge or find this conclusion surprising.
  
  Also, it seems that we can reach a similar conclusion about the « worse case analysis »: I thought this was something we may need to demonstrate/clarify; you thought this was so obvious I wouldn’t possibly misinterpret you as suggesting the opposite.
  
  Ilove symetries. :)