jacob_cannell comments on Concept Safety: What are concepts for, and how to deal with alien concepts

jacob_cannell 19 Apr 2015 21:24 UTC
2 points

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn’t seem to work quite the same way.

Note that practical artificial RL agents make decisions using the value function (indeed that is it’s entire point), rather than directly computing the expected future discounted reward—as computing that measure is generally intractable.

The details of the various RL mechanisms in the human brain are complex and are still very much an active area of study, but if anything the evidence rather strongly indicates that value function approximation is a necessary component of the biological solution for the same reasons we employ it in our artificial systems (and perhaps more so, because neurons are slow and thus fast planning search is much less of an option).

Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake.

Any practical RL agents exhibit the same behavior: once an agent learns that something leads to rewards, it encodes that as a valuable something in its value function—value is just approximate expected discounted future reward.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward… when the reward function is defined over the old model.

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

For example, a utility function that assigns high value to “stay in the box” is of course probably a very bad idea due to perverse instantiations.

In the concept learning approach—if I understand it correctly—we define the reward/utility function through a manual mapping of concepts → utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

One issue with this type of mapping (versus say the IRL alternative) is it requires the designers to determine in advance some key hyperparameters of a correct/safe reward function, such as it’s overall distribution over time.

The other bigger issue is the distinction between concepts that represent actual concrete beliefs about the current state of the world vs imagined beliefs about the world or abstract beliefs about the potential future world state. We want the reward function to be high only for concept sequence inputs that correspond to internal representations of the AI actually observing and believing that it did something ‘good’, not situations where the AI just imagines a good outcome. This is actually gets pretty tricky quickly, because essentially it involves mapping out what amount to simulated outcome states in the AI’s mind.

You can’t just have the AI imagine a nice world in the abstract and then hard code that concept to high reward. You actually need the AI to concretely experience a nice world internally and map those concept sequences to high reward.

In the case of DQN atari agent, this isn’t a problem, because the agent exists entirely in a simulation that provides the correct training data that fully covers the relevant domain of the reward function.
- Kaj_Sotala 20 Apr 2015 8:52 UTC
  1 point
  Parent
  
  The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.
  
  Agreed, this is a key problem.
  
  In the concept learning approach—if I understand it correctly—we define the reward/utility function through a manual mapping of concepts → utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.
  
  I’ve intentionally left vague the exact mechanism of how to define the initial utility function, since I don’t feel like I have a good answer to it. An IRL approach sounds like it’d be one possible way of doing it, but I haven’t had the chance to read more about it yet.
  
  The main scenario I had implicitly in mind had something resembling a “childhood” for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered “good” or “bad”, so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback. Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts. Of course this essentially assumes a slow takeoff and an environment where there is time to give the AI an extended childhood, so it might very well be that this is unfeasible. (Another potential problem with it is that the AI’s values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)
  
  Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the “right” judgments. There a problem would be in finding an appropriate and large enough dataset, plus again the fact that different humans would have differing judgments, making the set noisy. (But maybe that could be leveraged to one’s advantage, too, so that the AI would only be sure about the kinds of moral values that were nearly universally agreed upon.)
  - jacob_cannell 20 Apr 2015 17:03 UTC
    0 points
    Parent
    
    The main scenario I had implicitly in mind had something resembling a “childhood” for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered “good” or “bad”, so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback.
    
    This really is the most realistic scenario for AGI in general, given the generality of the RL architecture. Of course, there are many variations—especially in how the training environment and utility feedback interact.
    
    Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts
    
    If we want the AI to do human-ish labor tasks, a humanoid body makes lots of sense. It also makes sense for virtual acting, interacting with humans in general, etc. A virtual humanoid body has many advantages—with instantiation in a physical robot as a special case.
    
    (Another potential problem with it is that the AI’s values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)
    
    Yep—kindof unavoidable unless somebody releases the first advanced AGI for free. Even then, most people wouldn’t invest the time to educate and instill their values.
    
    Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the “right” judgments.
    
    So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI’s reward/utility function? You’d need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.
    - Kaj_Sotala 23 Apr 2015 13:08 UTC
      3 points
      Parent
      
      This really is the most realistic scenario for AGI in general, given the generality of the RL architecture.
      
      Of course, “gradually training the AGI’s values through an extended childhood” gets tricky if it turns out that there’s a hard takeoff.
      
      So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI’s reward/utility function? You’d need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.
      
      I was thinking that the task of training the AI to classify human judgments would then lead to it building up a model of human values, similar to the way that training a system to do word prediction builds up a language / world model. You make a good point of the need to then ground those values further; I haven’t really thought about that part.
      - jacob_cannell 23 Apr 2015 18:50 UTC
        0 points
        Parent
        
        Of course, “gradually training the AGI’s values through an extended childhood” gets tricky if it turns out that there’s a hard takeoff.
        
        Yes. Once you get the AGI up to roughly human child level, presumably autodidactic learning could takeover. Reading and understanding text on the internet is a specific key activity that could likely be sped up by a large factor.
        
        So—then we need ways to speed up human interaction and supervision to match.