Zac Hatfield-Dodds comments on Learning societal values from law as part of an AGI alignment strategy

Zac Hatfield-Dodds 21 Oct 2022 8:23 UTC
11 points
7

(2) Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.

(4) If AGI learns law, it will understand how to interpret vague human directives and societal human values well enough that its actions will not cause states of the world that dramatically diverge from human preferences and societal values.

I think (2) is straightforwardly false: the lethal difficulty is not getting something that understands what humans want; it’s that by default it’s very unlikely to care! The same argument implies that (4) might be true as written, but still irrelevant to x-risk.

Additionally, “Law” is not unique, not always humane, not always just, inconsistent between countries or even subnational jurisdictions, etc. - consider for example reproductive rights. I reject the assertion that law is the only legitimate source of human values, or indeed that law is a source of values at all (I’d describe it as a partial expression). Law might indeed turn out to be useful, but I don’t see it as distinguished from other forms of nonfiction or for that matter novels, poetry, etc.
- John Nay 21 Oct 2022 17:18 UTC
  11 points
  5
  Parent
  I don’t think anyone is claiming that law is “always humane” or “always just” or anything of that nature.
  This post is claiming that law is imperfect, but that there is no better alternative of a synthesized source of human values than democratic law. You note that law is not distinguished from “other forms of nonfiction or for that matter novels, poetry, etc” in this context, but the most likely second best source of a synthesized source of human values would not be something like poetry—it would be ethics. And, there are some critical distinguishing factors between law and ethics (and certainly between law and something like poetry):
  - There is no unified ethical theory precise enough to be practically useful for AI understanding human preferences and values.
    Law, on the other hand, is actionable now in a real-world practically applicable way.
  - Ethics does not have any rigorous tests of its theories. We cannot validate ethical theories in any widely agreed-upon manner.
    Law, on the other hand, although deeply theoretical and debated by academics, lawyers, and millions of citizens, is constantly formally tested through agreed-upon forums and processes.
  - There is no database of empirical applications of ethical theories (especially not one with sufficient ecological validity) that can be leveraged by machine learning processes.
    Law, on the other hand, has reams of data on empirical application with sufficient ecological validity (real-world situations, not disembodied hypotheticals).
  - Ethics, by its nature, lacks settled precedent across, and even within, theories. There are, justifiably, fundamental disagreements between reasonable people about which ethical theory would be best to implement.
    Law, on the other hand, has settled precedent, which can be updated to evolve with human values changing over time.
  - Even if AGI designers (impossibly) agreed on one ethical theory (or ensemble of underlying theories) being “correct,” there is no mechanism to align the rest of the humans around that theory (or meta-theory).
    Law, on the other hand, has legitimate authority imposed by government institutions.
  - Even if AI designers (impossibly) agreed on one ethical theory (or ensemble of underlying theories) being “correct,” it is unclear how any consensus update mechanism to that chosen ethical theory could be implemented to reflect evolving (usually, improving) ethical norms. Society is likely more ethical than it was in previous generations, and humans are certainly not at a theoretically achievable ethical peak now. Hopefully we continue on a positive trajectory. Therefore, we do not want to lock in today’s ethics without a clear, widely-agreed-upon, and trustworthy update mechanism.
    Law, on the other hand, is formally revised to reflect the evolving will of citizens.
  - Zac Hatfield-Dodds 21 Oct 2022 22:59 UTC
    6 points
    0
    Parent
    Thanks for an excellent reply! One possible crux is that I don’t think that synthesized human values are particularly useful; I’d expect that AGI systems can do their own synthesis from a much wider range of evidence (including law, fiction, direct observation, etc.). As to the specific points, I’d respond:
    
    There is no unified legal theory precise enough to be practically useful for AI understanding human preferences and values; liberal and social democracies alike tend to embed constraints in law, with individuals and communities pursuing their values in the lacunae.
    The rigorous tests of legal theories are carried out inside the system of law, and bent by systems of unjust power (e.g. disenfranchisement). We cannot validate laws or legal theories in any widely agreed-upon manner.
    Law often lacks settled precedent, especially regarding new technologies, or disagreements between nations or different cultures.
    I reject the assertion that imposition by a government necessarily makes a law legitimate. While I agree we don’t have a mechanism to ‘align the rest of the humans’ with a theory or meta-theory, I don’t think this is relevant (and in any case it’s equally applicable to law).
    I agree that “moral lock-in” would be a disaster. However, I dispute that law accurately reflects the evolving will of citizens; or the proposition that so reflecting citizen’s will is consistently good (c.f. reproductive rights, civil rights, impacts on foreign nationals or future generations...)
    
    These points are about law as it exists as a widely-deployed technology, not idealized democratic law. However, only the former is available to would-be AGI developers!
    
    Law does indeed provide useful evidence about human values, coordination problems, and legitimacy—but this alone does not distinguish it.
    What links here?
    aogara's comment on Trying to disambiguate different questions about whether RLHF is “good” by Buck (14 Dec 2022 6:52 UTC; 3 points)
    - John Nay 21 Oct 2022 23:37 UTC
      4 points
      3
      Parent
      Thanks for the reply.
      There does seem to be legal theory precise enough to be practically useful for AI understanding human preferences and values. To take just one example: the huge amount of legal theory on the how to craft directives. For instance, whether to make directives in contracts and legislation more of a rule nature or a standards nature. Rules (e.g., “do not drive more than 60 miles per hour”) are more targeted directives than standards. If comprehensive enough for the complexity of their application, rules allow the rule-maker to have more clarity than standards over the outcomes that will be realized conditional on the specified states (and agents’ actions in those states, which are a function of any behavioral impact the rules might have had). Standards (e.g., “drive reasonably” for California highways) allow parties to contracts, judges, regulators, and citizens to develop shared understandings and adapt them to novel situations (i.e., to generalize expectations regarding actions taken to unspecified states of the world). If rules are not written with enough potential states of the world in mind, they can lead to unanticipated undesirable outcomes (e.g., a driver following the rule above is too slow to bring their passenger to the hospital in time to save their life), but to enumerate all the potentially relevant state-action pairs is excessively costly outside of the simplest environments. In practice, most legal provisions land somewhere on a spectrum between pure rule and pure standard, and legal theory can help us estimate the right location and combination of “rule-ness” and “standard-ness” when specifying new AI objectives. There are other helpful legal theory dimensions to legal provision implementation related to the rule-ness versus standard-ness axis that could further elucidate AI design, e.g., “determinacy,” “privately adaptable” (“rules that allocate initial entitlements but do not specify end-states”), and “catalogs” (“a legal command comprising a specific enumeration of behaviors, prohibitions, or items that share a salient common denominator and a residual category—often denoted by the words “and the like” or “such as””).
      Laws are validated in a widely agreed-upon manner: court opinions.
      I agree that law lacks settled precedent across nations, but within a nation like the U.S.: at any given time, there is a settled precedent. New precedents are routinely set, but at any given time there is a body of law that represents the latest versioning.
      It seems that a crux of our overall disagreement about the usefulness of law is whether imposition by a democratic government makes a law legitimate. My arguments depend on that being true.
      In response to “I dispute that law accurately reflects the evolving will of citizens; or the proposition that so reflecting citizen’s will is consistently good”, I agree it does not represent the evolving will of citizens perfectly, but it does so better than any alternative. I think reflecting the latest version of citizens’ views is important because I hope we continue on a positive trajectory to having better views over time.
      The bottom line is that democratic law is far from perfect, but, as a process, I don’t see any better alternative that would garner the buy-in needed to practically elicit human values in a scalable manner that could inform AGI about society-level choices.
- aogara 21 Oct 2022 9:16 UTC
  4 points
  0
  Parent
  I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.
  
  Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challenging.” Section 4.1 is therefore about value learning. (Section 4.2 is about the difficulty of getting a system to internalize those values.)
  
  Robin Shah’s sequence on Value Learning argues that “Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.”
  
  Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020) says the following: “Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. Even for a slight misspecification, a very good RL algorithm might be able to find an intricate solution that is quite different from the intended solution, even if a poorer algorithm would not be able to find this solution and thus yield solutions that are closer to the intended outcome. This means that correctly specifying intent can become more important for achieving the desired outcome as RL algorithms improve. It will therefore be essential that the ability of researchers to correctly specify tasks keeps up with the ability of agents to find novel solutions.
  
  John is suggesting one way of working on the outer alignment problem, while Zach is pointing out that inner alignment is arguably more dangerous. These are both fair points IMO. In my experience, people on this website often reject work on specifying human values in favor of problems that are more abstract but seen as more fundamentally difficult. Personally I’m glad there are people working on both.