paulfchristiano comments on Three Approaches to “Friendliness”

paulfchristiano 12 Apr 2015 4:30 UTC
2 points
0

if I did not already have native philosophical abilities on par with the overseer, I I couldn’t give answers to any philosophical questions that the overseer would find helpful, unless I had the superhuman ability to create a model of the overseer including his philosophical abilities, from scratch.

I don’t quite understand the juxtaposition to the white box metaphilosophical algorithm. If we could make a simple algorithm which exhibited weak philosphical ability, can’t the RL learner also use such a simple algorithm to find weak philosophical answers (which will in turn receive a reasonable payoff from us)?

Is the idea that by writing the white box algorithm we are providing key insights about what metaphilosphy is, that an AI can’t extract from a discussion with us or inspection of our philosphical reasoning? At a minimum it seems like we could teach such an AI how to do philosphy, and this would be no harder than writing an algorithm (I grant that it may not be much easier).
- Wei Dai 12 Apr 2015 8:42 UTC
  2 points
  0
  Parent
  It seems to me that we need to understand metaphilosphy well enough to be able to write down a white-box algorithm for it, before we can be reasonably confident that the AI will correctly solve every philosophical problem that it eventually comes across. If we just teach an AI how to do philosophy without an explicit understanding of it in the form of an algorithm, how do we know that the AI has fully learned it (and not some subtly wrong version of doing philosophy)?
  
  Once we are able to write down a white-box algorithm, wouldn’t it be safer to implement, test, and debug the algorithm directly as part of an AI designed from the start to take advantage of the algorithm, rather than indirectly having an AI learn it (and then presumably verifying that its internal representation of the algorithm is correct and there aren’t any potentially bad interactions with the rest of the AI)? And even the latter could reasonably be called white-box also since you are actually looking inside the AI and making sure that it has the right stuff inside. I was mainly arguing against a purely black box approach, where we start to build AIs while having little understanding of metaphilosophy, and therefore can’t look inside the AI to see if has learned the right thing.