TurnTrout comments on Symbol/Referent Confusions in Language Model Alignment Experiments

TurnTrout 30 Nov 2023 11:43 UTC
4 points
0
I think it’s good to be careful in going from “I think this person has made a mistake in this instance” to “and that’s a property of who they are.”
- tailcalled 30 Nov 2023 15:19 UTC
  6 points
  0
  Parent
  Right. I think there’s a lot of layers to this.
  Words like “gullible” can be used to describe how a person has acted in a single instance, or their disposition towards some specific subject, or their general disposition in life, or a biological disability.
  I think @Steven Byrnes made a good critique of calling Gus gullible in this specific case, though because his argument felt conceptually inverted from yours (he argued that it’s the agreement between John and Gus that makes it unreasonable to call Gus gullible, whereas you emphasized the disagreement when arguing that it is too combative), it becomes a bit tangential from the line of argument you are making. I don’t think I know of any alternative words to use though, so I will keep using the word, with the caveat that I acknowledge that it is somewhat unfair to Gus (and by implication, to you).
  (Oh also, another possibility for why Gus might not be gullible in this particular case is, maybe he just presented this dialogue as a post-hoc rationalization, and it’s not his true reason for believing in corrigibility, rather than just being a highly legible and easily demonstrable piece of evidence for corrigibility that he picked out of convenience.)
  Partly because people’s activities and relationships are extended in time, it is possible (and common) for people to have stable dispositions that are specific to a single subject, rather than generalized across all subjects. So for instance, while Gus might not be gullible towards all people in general, this doesn’t mean that gullibility was simply a mistake that Gus made in this instance, because if he doesn’t change his approach to evaluating the AI, he will keep making the same mistake. But if he does change his approach, then he can just go “Ah, derp, I see what you mean about my mistake, I’ll be less gullible about this in the future”, and I think this can reasonably well stop it from becoming seen as a property of who he is.
  There’s three important sub-points to this. First, simply becoming more skeptical of what AIs say would be an overly forceful, deferential way of stopping being gullible. He presumably had some reason, even if only an implicit heuristic model, for why he thought he could just believe what the AI says. If he just forced himself to stop believing it, he would be blindly destroying that model. It seems better to introspect and think about what his model says, and why he believes that model, to identify the “bug” in the model that lead to this conclusion. It might be that there’s only a minor change that’s needed (e.g. maybe just a boundary condition that says that there’s some specific circumstances where he can believe it).
  Another important sub-point is social pressure. Being labelled as gullible (whether the label is with respect to some specific subject, or more generally) may be embarrassing, and it can lead to people unfairly dismissing one’s opinions in general, which can lead to a desire to dodge that label, even if that means misrepresenting one’s opinion to be more in accordance with consensus. That leads to a whole bunch of problems.
  In a culture that is sufficiently informative/accountable/??? (which the rationalist community often is), I find that an effective alternative to this is to request more elaboration. But sometimes the people in the culture can’t provide a proper explanation—and it seems like when you attempted to ask for an explanation, rationalists mostly just did magical thinking. That might not need to be a problem, except that it’s hard to be sure that there isn’t a proper explanation, and even harder to establish that as common knowledge. If it can’t become common knowledge that there is confusion in this area, then discourse can get really off the rails because confusions can’t get corrected. I think there’s a need for some sort of institution or mechanism to solve this, though so far all the attempts I’ve come up with have had unresolvable flaws. In particular, while eliminating social pressure/status-reducing labels might make people more comfortable with questioning widespread assumptions, it would do so simply by generally eliminating critique, which would prevent coordination and cooperation aimed at solving problems out in the world.
  Anyway, backing up from the subpoints to the point about general gullibility. Yes, even if Gus is gullible with respect to this specific way of thinking about AI, he might not be gullible in general. Gullibility isn’t very correlated across contexts, so it’s correct that in the general case, one shouldn’t infer general gullibility from specific gullibility. But… I’m not actually sure that’s correct in your case? Like you’ve mentioned yourself that you have a track record of “Too much deference, too little thinking for myself”, you’ve outright defended general gullibility in the context of this discussion, and while you endorse the risk of AI destroying the world, I have trouble seeing how it is implied by your models, so I think your belief in AI x-risk may be a result of deference too (though maybe you just haven’t gotten around to writing up the way your models predict x-risk, or maybe I’ve just missed it). So you might just be generally gullible.
  This doesn’t need to imply that you have a biological disability that makes you unalterably gullible, though! After all, I have no reason to think this to be the case. (Some people would say “but twin studies show everything to be heritable!”, but that’s a dumb argument.) Rather, gullibility might be something that would be worth working on (especially because beyond this you are a quite skilled and productive researcher, so it’d be sad if this ended up as a major obstacle to you).
  I’m not sure whether I’m the best or the worst person to advice you on gullibility. Like I am kind of an extreme case here, being autistic, having written a defense of some kinds of gullibility, having been tricked in some quite deep ways, and so on. But I have been thinking quite deeply about some of these topics, so maybe I have some useful advice.
  Gullibility is mainly an issue when it comes to latent variables, i.e. information that cannot easily be precisely observed, but instead has imperfect correlations with observable variables. There’s lots of important decisions whose best choice depends on these variables, so you’ll be searching for information about them, and lots of people claim to offer you information, which due to the high uncertainty can sway your actions a lot, but often the information they offer is bad or even biased, which can sway your actions in bad ways.
  To avoid errors due to gullibility, it can be tempting to find ways to simply directly observe these variables. But there’s often multiple similar correlated latent variables, and when coming up with a way to observe one latent variable, one often ends up observing a different one. The way to solve this problem is to ask “what is the purpose for observing this latent variable?”, and then pick the one that fits this purpose, with the understanding that different purposes would have different definitions. (This is the operationalization issue John Wentworth walks through in the OP.) In particular, bold causal reasoning (but not necessarily mechanistic reasoning—we don’t know how future AIs will be implemented) helps because it gets right to the heart of which definition to select.
  Another tempting approach to avoiding errors due to gullibility is to look into improving one’s standards for selecting people to trust. Unfortunately, “trustworthiness” is actually huge class of latent variables as there are many different contexts where one could need to trust someone, and it therefore suffers from the same problems as above. Rather than simply trusting or not trusting people, it is better to model people as a mosaic of complex characteristics, many of which can be great virtues in some situations and vices in other situations. For instance, Eliezer Yudkowsky is very strongly attracted to generally applicable principles, which makes him a repository of lots of deep wisdom, but it also allows him to have poorly informed opinions about lots of topics.
  A final point is, context and boundary conditions. Often, similar-sounding ideas can be brought up in different kinds of situations, where they work in one kind of situation, but don’t work in another kind of situation. In that case, those in the former kind of situation might say that the ideas work, while those in the latter kind of situation might say that the ideas don’t work. In that case, if one learned more about both of their situations, one might be able to find out what the boundary conditions of their ideas are.
  One complication to this final point is, a lot of the time people just adopt ideas because their community has those ideas, in which case they might not have any particular situation in mind where they do or do not work. In such a case, one kind of needs to track down the origin of the idea to figure out the boundary conditions, which is a giant pain and usually infeasible.

TurnTrout comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

TurnTrout comments on Symbol/Referent Confusions in Language Model Alignment Experiments