[deleted] comments on Debunking Fallacies in the Theory of AI Motivation

[deleted]May 6, 2015, 4:08 PM
10 points
With all of the above in mind, a quick survey of some of the things that you just said, with my explanation for why each one would not (or probably would not) be as much of an issue as you think:

As humans, we have a good idea of what “giving choices to people” vs. “forcing them to do something” looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the “forceful” category (even though it can be done with only text).

For a massive-weak-constraint system, psychological manipulation would be automatically understood to be in the forceful category, because the concept of “psychological manipulation” is defined by a cluster of features that involve intentional deception, and since the “friendliness” concept would ALSO involve a cluster of weak constraints, it would include the extended idea of intentional deception. It would have to, because intentional deception is connected to doing harm, which is connected with unfriendly, etc.

Conclusion: that is not really an “edge” case in the sense that someone has to explicitly remember to deal with it.

Very likely, the concept space will be very complicated and difficult for humans to understand.

We will not need to ‘understand’ the AGI’s concept space too much, if we are both using massive weak constraints, with convergent semantics. This point I addressed in more detail already.

This seems pretty similar to Paul’s idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we’ll probably get AGI before we get uploads, so we’ll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we’ll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it’s a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.

What you are talking about here is the idea of simulating a human to predict their response. Now, humans already do this in a massive way, and they do not do it by making gigantic simulations, but just by doing simple modeling. And, crucially, they rely on the masive-weak-constraints-with-convergent-semantics (you can see now why I need to coin the concise term “Swarm Relaxation”) between the self and other minds to keep the problem manageable.

That particular idea—of predicting human response—was not critical to the argument that followed, however.

What language will people’s questions about the plans be in? If it’s a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this.

No, we would not have to solve a FAI-complete problem to do it. We will be developing the AGI from a baby state up to adulthood, keeping its motivation system in sync all the way up, and looking for deviations. So, in other words, we would not need to FIRST build the AGI (with potentially dangerous alen semantics), THEN do a translation between the two semantic systems, THEN go back and use the translation to reconstruct the motivation system of the AGI to make sure it is safe.

Much more could be said about the process of “growing” and “monitoring” the AGI during the development period, but suffice it to say that this process is extremely different if you have a Swarm Relaxation system vs. a logical system of the sort your words imply.

We should also definitely be wary of a decision rule of the form “find a plan that, if explained to humans, would cause humans to say they understand it”.

This hits the nail on the head. This comes under the heading of a strong constraint, or a point-source failure mode. The motivation system of a Swarm Relaxation system would not contain “decision rules” of that sort, precisely because they could have large, divergent effects on the behavior. If motivation is, instead, governed by large numbers of weak constraints, and in this case your decision rule would be seen to be a type of deliberate deception, or manipulation, of the humans. And that contradicts a vast array of constraints that are consistent with friendliness.

Again, it’s quite plausible that the AI’s concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI’s concept space in order to pinpoint this concept so it can be integrated into the AI’s decision rule.

Same as previous: with a design that does not use decision rules that are prone to point-source failure modes, the issue evaporates.

To summarize: much depends on an understanding of the concept of a weak constraint system. There are no really good readings I can send you (I know I should write one), but you can take a look at the introductory chapter of McClelland and Rumelhart that I gave in the references to the paper.

Also, there is a more recent reference to this concept, from an unexpected source. Yann LeCun has been giving some lectures on Deep Learning in which he came up with a phrase that could have been used two decades ago to describe exactly the sort of behavior to be expected from SA systems. He titles his lecture “The Unreasonable Effectiveness of Deep Learning”. That is a wonderful way to express it: swarm relaxation systems do not have to work (there really is no math that can tell you that they should be as good as they are), but they do. They are “unreasonably effective”.

There is a very deep truth buried in that phrase, and a lot of what I have to say about SA is encapsulated in it.
- jessicat May 7, 2015, 8:27 AM
  6 points
  Parent
  Okay, thanks a lot for the detailed response. I’ll explain a bit about where I’m coming from with understading the concept learning problem:
  - I typically think of concepts as probabilistic programs eventually bottoming out in sense data. So we have some “language” with a “library” of concepts (probabilistic generative models) that can be combined to create new concepts, and combinations of concepts are used to explain complex sensory data (for example, we might compose different generative models at different levels to explain a picture of a scene). We can (in theory) use probabilistic program induction to have uncertainty about how different concepts are combined. This seems like a type of swarm relaxation, due to probabilistic constraints being fuzzy. I briefly skimmed through the McClellard chapter and it seems to mesh well with my understanding of probabilistic programming.
  - But, when thinking about how to create friendly AI, I typically use the very conservative assumptions of statistical learning theory, which give us guarantees against certain kinds of overfitting but no guarantee of proper behavior on novel edge cases. Statistical learning theory is certainly too pessimistic, but there isn’t any less pessimistic model for what concepts we expect to learn that I trust. While the view of concepts as probabilistic programs in the previous bullet point implies properties of the system other than those implied by statistical learning theory, I don’t actually have good formal models of these, so I end up using statistical learning theory.
  I do think that figuring out if we can get more optimistic (but still justified) assumptions is good. You mention empirical experience with swarm relaxation as a possible way of gaining confidence that it is learning concepts correctly. Now that I think about it, bad handling of novel edge cases might be a form of “meta-overfitting”, and perhaps we can gain confidence in a system’s ability to deal with context shifts by having it go through a series of context shifts well without overfitting. This is the sort of thing that might work, and more research into whether it does is valuable, but it still seems worth preparing for the case where it doesn’t.
  
  Anyway, thanks for giving me some good things to think about. I think I see how a lot of our disagreements mostly come down to how much convergence we expect from different concept learning systems. For example, if “psychological manipulation” is in some sense a natural category, then of course it can be added as a weak (or even strong) constraint on the system.
  I’ll probably think about this a lot more and eventually write up something explaining reasons why we might or might not expect to get convergent concepts from different systems, and the degree to which this changes based on how value-laden a concept is.
  
  There is a lot of talk that can be given about how that complex union takes place, but here is one very important takeaway: it can always be made to happen in such a way that there will not, in the future, be any Gotcha cases (those where you thought you did completely merge the two concepts, but where you suddenly find a peculiar situation where you got it disastrously wrong). The reason why you won’t get any Gotcha cases is that the concepts are defined by large numbers of weak constraints, and no strong constraints—in such systems, the effect of smaller and smaller numbers of concepts can be guaranteed to converge to zero. (This happens for the same reason that the effect of smaller and smaller sub-populations of the molecules in a gas will converge to zero as the population sizes go to zero).
  
  I didn’t really understand a lot of what you said here. My current model is something like “if a concept is defined by lots of weak constraints, then lots of these constraints have to go wrong at once for the concept to go wrong, and we think this is unlikely due to induction and some kind of independence/uncorrelatedness assumption”; is this correct? If this is the right understanding, I think I have low confidence that errors in each weak constraint are in fact not strongly correlated with each other.
  - [deleted]May 7, 2015, 2:57 PM
    10 points
    Parent
    I think you have homed in exactly on the place where the disagreement is located. I am glad we got here so quickly (it usually takes a very long time, where it happens at all).
    
    Yes, it is the fact that “weak constraint” systems have (supposedly) the property that they are making the greatest possible attempt to find a state of mutual consistency among the concepts, that leads to the very different conclusions that I come to, versus the conclusions that seem to inhere in logical approaches to AGI. There really is no underestimating the drastic difference between these two perspectives: this is not just a matter of two possible mechanisms, it is much more like a clash of paradigms (if you’ll forgive a cliche that I know some people absolutely abhor).
    
    One way to summarize the difference is by imagining a sequence of AI designs, with progressive increases in sophistication. At the beginning, the representation of concepts is simple, the truth values are just T and F, and the rules for generating new theorems from the axioms are simple and rigid.
    
    As the designs get better various new features are introduced … but one way to look at the progression of features is that constraints between elements of the system get more widespread, and more subtle in nature, as the types of AI become better and better.
    
    An almost trivial example of what I mean: when someone builds a real-time reasoning engine in which there has to be a strict curtailment of the time spent doing certain types of searches in the knowledge base, a wise AI programmer will insert some sanity checks that kick in after the search has to be curtailed. The sanity checks are a kind of linkage from the inference being examined, to the rest of the knowledge that the system has, to see if the truncated reasoning left the system in a state where it concluded something that is patently stupid. These sanity checks are almost always extramural to the logical process—for which read: they are glorified kludges—but in a real world system they are absolutely vital. Now, from my point of view what these sanity checks do is act as weak constraints on one little episode in the behavior of the system.
    
    Okay, so if you buy my suggestion that in practice AI systems become better, the more that they allow the little reasoning episodes to be connected to the rest of system by weak constraints, then I would like to go one step further and propose the following:
    
    1) As a matter of fact, you can build AI systems (or, parts of AI systems) that take the whole “let’s connect everything up with weak constraints” idea to an extreme, throwing away almost everything else (all the logic!) and keeping only the huge population of constraints, and something amazing happens: the system works better that way. (An old classic example, but one which still has lessons to teach, is the very crude Interactive Activation model of word recognition. Seen in its historical context it was a bombshell, because it dumped all the procedural programming that people had thought was necessary to do word recognition from features, and replaced it with nothing-but-weak-constraints …. and it worked better than any procedural program was able to do.)
    
    2) This extreme attitude to the power of weak constraints comes with a price: you CANNOT have mathematical assurances or guarantees of correct behavior. Your new weak-constraint system might actually be infinitely more reliable and stable than any of the systems you could build, where there is a possibility to get some kind of mathematical guarantees of correctness or convergence, but you might never be able to prove that fact (except with some general talk about the properties of ensembles).
    
    All of that is what is buried in the phrase I stole from Yann LeCun: the “unreasonable effectiveness” idea. These systems are unreasonably good at doing what they do. They shouldn’t be so good. But they are.
    
    As you can imagine, this is such a huge departure from the traditional way of thinking in AI, that many people find it completely alien. Believe it or not, I know people who seem willing to go to any lengths to destroy the credibility of someone who suggests the idea that mathematical rigor might be a bad thing in AI, or that there are ways of doing AI that are better than the status quo, but which involve downgrading the role of mathematics to just technical-support level, rather than primacy.
    
    --
    
    On your last question, I should say that I was only referring to the fact that in systems of weak constraints, there is extreme independence between the constraints, and they are all relatively small, so it is hard for an extremely inconsistent ‘belief’ or ‘fact’ to survive without being corrected. This is all about the idea of “single point of failure” and its antithesis.
  - [deleted]May 7, 2015, 3:06 PM
    3 points
    Parent
    
    I briefly skimmed through the McClellard chapter and it seems to mesh well with my understanding of probabilistic programming.
    
    I think it would not go amiss to read Vikash Masinghka’s PhD thesis and the open-world generation paper to see a helpful probabilistic programming approach to these issues. In summary: we can use probabilistic programming to learn the models we need, use conditioning/query to condition the models on the constraints we intend to enforce, and then sample the resulting distributions to generate “actions” which are very likely to be “good enough” and very unlikely to be “bad”. We sample instead of inferring the maximum-a-posteriori action or expected action precisely because as part of the Bayesian modelling process we assume that the peak of our probability density does not necessary correspond to an in-the-world optimum.
    - jessicat May 7, 2015, 5:24 PM
      1 point
      Parent
      I agree that choosing an action randomly (with higher probability for good actions) is a good way to create a fuzzy satisficer. Do you have any insights into how to:
      
      create queries for planning that don’t suffer from “wishful thinking”, with or without nested queries. Basically the problem is that if I want an action conditioned on receiving a high utility (e.g. we have a factor on the expected utility node U equal to e^(alpha * U) ), then we are likely to choose high-variance actions while inferring that the rest of the model works out such that these actions return high utilities
      
      extend this to sequential planning without nested nested nested nested nested nested queries
- Gunnar_Zarncke May 6, 2015, 10:28 PM
  0 points
  Parent
  That concept spaces can be matched without gotchas is reassuring and may point into a direction AGI can be made friendly. If the concepts are suitably matched in your proposed checking modules. If. And if no other errors are made.