Wei Dai comments on Managing risks while trying to do good

Wei Dai 2 Feb 2024 2:56 UTC
7 points
0
I wrote a post expressing similar sentiments but perhaps with a different slant. To me, apparent human morality along the lines of “heretics deserve eternal torture in hell” or what was expressed during the Chinese Cultural Revolution are themselves largely a product of status games, and there’s a big chance that these apparent values do not represent people’s true values and instead represent some kind of error (but I’m not sure and would not want to rely on this being true). See also Six Plausible Meta-Ethical Alternatives for some relevant background.

But you’re right that the focus of my post here is on people who endorse altruistic values that seem more reasonable to me, like EAs, and maybe earlier (pre-1949) Chinese supporters of communism who were mostly just trying to build a modern nation with a good economy and good governance, but didn’t take seriously enough the risk that their plan would backfire catastrophically.
- ThomasCederborg 2 Feb 2024 21:34 UTC
  3 points
  0
  Parent
  I don’t think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?″ question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell″. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, that someone would like others to read, in a stolen diary. But people are not easy to interpret, especially across centuries of distance. Maybe for some people, someone else stealing their diary, and reading such meditations, would be awesome). And people are not perfect liars, so maybe the act of making such entries is, mostly, an effective way, of getting into an emotional state, such that one seems genuine, when expressing remorse to other people? So, maybe any reasonable way of extrapolating a diarist like this, will lead to a mind, that find the idea of hell, abhorrent. There is a lot of uncertainty here. There is probably also a very, very large diversity, among the set of humans that have adopted a normative position, along these lines (and not just in terms of terminology, and in terms of who counts as a heretic. Also in terms of what it is, that was lying underneath, the adoption of such normative positions. It would not be very surprising, if a given extrapolation procedure, leads to different outcomes, for two individuals, that sound very similar). As long as we agree that any AI design, must be robust to the possibility, that people mean what they say, then perhaps these issues are not critical to resolve (but, on the other hand, maybe digging into this some more, will lead to genuinely important insights). (I agree that there were probably a great number of people, especially early on, that was trying to achieve things that most people today would find reasonable, but whose actions contributed to destructive movements. Such issues are probably a lot more problematic in politics, than in the case where an AI is getting its goal from a set of humans) (none of my reasoning here is done, with EAs in mind)
  I think that there exists a deeper problem, for the proposition, that perhaps it is possible to find some version of CEV, that is actually safe for human individuals (as opposed to the much easier task, of finding a version of CEV, such that no one is able to outline a thought experiment, before launch time, that shows, why this specific version, would lead to an outcome, that is far, far, worse than extinction). Specifically, I’m referring to the fact that ``heretics deserve eternal torture in hell″ style fanatics (F1), is just one very specific example, of a group of humans, that might be granted extreme influence, over CEV. In a population of billions, there will exist a very, very large number of ``never-explicitly-considered″ types of minds. Consider for example a different, tiny, group of Fanatics (F2), who (after being extrapolated) has a very strong ``all or nothing″ attitude, and a sacred rule against negotiations (let’s explore what happens in the case, where this attitude is related to a religion, and where one in a thousand humans, will be part of F2). Unless negotiations deadlock in a very specific way, PCEV will grant F2, exactly zero direct influence. However, let’s explore what happens, if another version of CEV is launched, that first maps each individual to a Utility function, and then maximise the Sum of those functions (USCEV). During the process, where a member of this religion, that we can call Gregg, ``becomes the person that Gregg wants to be″, the driving aspect of Gregg’s personality, is a burning desire to become a true believer, and become morally pure. This includes, becoming the type of person, that would never break the sacred set of rules: ``Never accept any compromise, regarding what the world should look like! Never negotiate with heretics! Always take whatever action, is most likely to result in the world being organised, exactly as is described in the sacred texts!″. So, the only reasonable way to map, extrapolated Gregg, to a utility function, is to assign maximum utility to the Outcome demanded by the Sacred Texts (OST), and minimum utility, to every other outcome. Besides the number of people in F2, the bound on how bad OST can be (from the perspective of the non believers), and still be the implemented outcome, is that USCEV, must be able to think up something that is far, far, worse (technically, the minimum is not actually the worst possible outcome, but instead the worst outcome that USCEV can think up, for each specific non-believer). As long as there is a very large difference, between OST, and the worst thing that USCEV can think up, then OST will be the selected outcome. Maybe OST will look ok, to a non super intelligent observer. For example, OST could look like a universe where every currently existing human individual, after an extended period of USCEV guided self reflection, converge on the same belief system (and all subsequent children, are then brought up in this belief system). Or, maybe it will be overtly bad, with everyone forced to convert or die. Or maybe it will be a genuine s-risk, for example along the lines of LP.
  As far as I can tell, CEV in general, and PCEV in particular, is, still, the current state of the art, in terms of finding an answer to the ``what alignment target, should be aimed at?″ question (and CEV has been the state of the art now, for almost two decades). I find this state of affairs strange, and deeply problematic. I’m confused by the relatively low interest, in efforts to make further progress on the ``what alignment target, should be aimed at?″ question (I think that, for example, the explanation, in the original CEV document, from 2004, was a very good explanation, for why this question matters. And I don’t think that it is a coincidence, that the specific analogy used, to make that point, was a political revolution (a brief paraphrasing: such a revolution must (i): succeed, and also (ii): lead to a new government, that is actually a good government. Similarly, an AI must (i): hit an alignment target, and also (ii): this alignment target, must be a good thing to hit)). Maybe I shouldn’t be surprised by this relative lack of interest. Maybe humans are just not great, in general, at reacting to ``AI danger″. But it still feels like I’m not seeing, I don’t know, … something (wild speculation by anyone that, at any point, happens to stumble upon this comment, regarding what this … something … might be, are very welcome. Either in a comment, or in a DM, or in an email).
- M. Y. Zuo 2 Feb 2024 7:10 UTC
  1 point
  0
  Parent
  There just may be systematic overvaluation of what people say instead of what they do, by practically everyone.
  For the average person, who is far from producing genuinely original ideas/insights/arguments/etc… Just what they say throughout their entire life might even be worth less, realistically, than a fancy dinner to a random passing reader.
  Conversely taking a bit of actual effort in buying said reader a fancy dinner probably more than doubles it, at least in the eyes of the person getting to eat it.
  Of course the opposite pretence needs to be maintained very often in normal day to day life, and after enough times, folks start genuinely believing the opposite. That just the mere prospect of losing a minor verbal status game implies that they must fanatically counter-signal.
  (i.e. the needle of their perception gets harder and harder to move over time)
  Which would explain the observed phenomena throughout history.